ZpqrtBnk

Faster GitHub PR check time with multi-level matrix strategy

Posted on September 26, 2022 in dotnet

The Hazelcast .NET¹ Client is provided as a library that is supported on .NET (Framework) version 4.6.2 to 4.8, .NET (Core) 3.1, and .NET (Just .NET) 5.0 and 6.0. For .NET Core and Just .NET, the library is supported both on Windows and on Linux. This means that we want to run our complete tests suite on both platforms, for each supported .NET version.

Originally, we defined a Build PR workflow that would trigger on each PR and build and run the tests for each .NET version supported by the platform, one after another. In order to test both platforms, we used an OS matrix strategy:

name: Build PR
on: pull_request

jobs:
  build-pr:
    name: Build PR (${{ matrix.os }})
    runs-on: ${{ matrix.os }}

    strategy:
      fail-fast: false
      matrix:
        os: [ ubuntu-latest, windows-latest ]

    steps:
    - name: Build and Test the PR
      (etc)

Thanks to this, on each PR, two jobs would run in parallel: one for Linux, and one for Windows. However, on each platform, the tests for each .NET version would run sequentially. And thus... if the complete tests suite takes about 20 minutes to run, the Windows job would take net462 + net48 + netcoreapp3.1 + net5.0 + net6.0 = 5 times 20 minutes, or 1h40. Which is quite a lot of time to validate a PR.

Entering the Second Dimension

We want to reduce that time, and the idea is to parallelize each .NET version, in addition to platforms. On each PR, we would fork one job for .NET 5.0 on Linux, one job for .NET 6.0 on Linux, one job for .NET 4.8 on Windows, etc. The total build time would be approximatively the build time of one version or, in our example above, 20 minutes instead of 1h40.

GitHub supports 2-dimensions matrixes. For instance:

    matrix:
      os: [ ubuntu-latest, windows-latest ]
      dotnet: [ net5.0, net6.0 ]

In our situation, however, the list of versions to put in the dotnet dimension if not fixed and actually depends on the OS. We would need to write something akin to the following code, where get_dotnet_version would be a function returning the .NET version corresponding to the OS:

    matrix:
      os: [ ubuntu-latest, windows-latest ]
      dotnet: get_dotnet_versions(matrix.os)

Alas, GitHub does not support that type of construct. The only thing it supports is referencing a variable, such as an input, in the matrix. So we would need to be able to produce a unique variable, and yet its value would depend on matrix.os... again, that type of construct is not supported. All in all, I could not find a way to get one matrix dimension to depend on the other, within one workflow.

Divide and Conquer

Now... what-if instead of trying to do everything in one workflow, we used two? Our first workflow looks like:

jobs:
  getversions:
    runs-on: ${{ matrix.os }}

    strategy:
      matrix:
        os: [ ubuntu-latest, windows-latest ]

    outputs:
      # beware! this cannot be dynamic = must match the OS matrix
      versions-ubuntu-latest: ${{ steps.getversions.outputs.versions-ubuntu-latest }}
      versions-windows-latest: ${{ steps.getversions.outputs.versions-windows-latest }}

    steps:
    - name: Determine DotNet Versions
      id: getversions
      shell: bash
      run: |
        VERSIONS=$(....)
        echo "::set-output name=versions-${{ matrix.os }}::$VERSIONS"

  buildtest:
    name: Build&Test ${{ matrix.os }}
    needs: getversions
    strategy:
      fail-fast: false
      matrix:
        os: [ ubuntu-latest, windows-latest ]
    uses: ./.github/workflows/build-pr-on.yml
    secrets: inherit
    with:
      os: ${{ matrix.os }}
      versions: ${{ needs.getversions.outputs[format('versions-{0}', matrix.os)] }}

Let's go through this workflow. It contains two jobs, getversions and buildtest.

The first job, getversions, runs a unique step that determines the versions based on the platform, and sets a job output variable accordingly. That variable contains a JSON list of versions, e.g. [ "net462", "net48" ]. Actually, it sets one job output per platform, and I could not find a way to avoid hard-coding the platform names.

The second job, buildtest, does not run any steps, but instead triggers another workflow, passing two input parameters: the platform, and the list of versions. And this is where we manage to specify versions that depend on the platform.

And then, the magic happens in that other workflow. First, in order to be "callable" with input parameters, it requires this on statement:

on: 
  workflow_call:
    inputs:
      os:
        required: true
        type: string
      versions:
        required: true
        type: string

And then its job is defined as:

jobs:
  buildtest:
    name: ${{ matrix.fwk }}
    runs-on: ${{ inputs.os }}
    strategy:
      fail-fast: false
      matrix:
        version: ${{ fromJson(inputs.versions) }}

And there we have it: a build and test job that runs on the specified platform, for each specified versions. The last piece of the puzzle consists in wrapping in all into one single conclusion point. This is done via an additional job in the first workflow:

report:
    name: Build&Test Result
    runs-on: ubuntu-latest
    if: always()
    needs: buildtest
    steps:
    - name: report
      shell: bash
      run: |
        if [ "${{ needs.buildtest.conclusion }}" == "success" ]; then
          echo "All Build&Test checks completed successfully."
        else
          echo "At least one Build&Test check has failed."
          echo "::error::At least one Build&Test check has failed."
        fi

This job "needs" the buildtest job, in other words it will only run once all our forked build & test jobs have completed. It always run, and succeeds or fails depending on the outputs of all the forked jobs. This means that this is what we see in GitHub actions:

The block to the left shows the two jobs that run, on per platform, and figure out the framework versions for each platform. The center block shows each combination of platform / version build - they all run in parallel. Finally, the block to the right shows the global status of the build.

That last "Build&Test Result" block can be used as a required check in our GitHub branch configuration, thus ensuring that all test combinations must be successful before any PR can be merged.

All in about 20 minutes, not 1h40!

or maybe it is the "Hazelcast dotnet Client" or "Hazelcast Dotnet Client", depending on the success of the #dropTheDot movement initiated by Khalid at a time the .NET Drama Advisory (or should it be the Dotnet Drama Advisory) was pretty low and he got bored. ↩

There used to be Disqus-powered comments here. They got very little engagement, and I am not a big fan of Disqus. So, comments are gone. If you want to discuss this article, your best bet is to ping me on Mastodon.