machine learning pipeline cloud resource incident



Context:

We own a core business machine learning pipeline that runs weekly.

The pipeline generates customer-campaign-content matching across 10+million customers, 200+ campaigns.

The pipeline also pass the final selected allocation to the downstream system to send out the  campaign send

Issue Summary:

The pipeline has to finish within 1 day due to the data availability timing and requirement of the downstream system.

The majority of the pipeline is running on GKE preemptible machine.

The parallel batch scoring step in the pipeline uses ~1000 containers/pods which requires ~66 x 64cpu 416ram machines.

In one of our production run, we notice that only 1 machine get allocated, which means the runtime will increase from 20min to 30+hours. This will not meet the deadline requirement.

Root Cause and Remediation:

The root cause is due to 

  • use of preemptible machine
  • machine type availability
  • resource allocation base on machine type

Preemptible machine

We use the preemptible machines to reduce the cost during prediction. Preemptible VMs offer the same machine types and options as regular compute instances and last for up to 24 hours.

In the event of a limited resource, only 1 machine is available to our pipeline where the rest are taken away.

We switch the machine node-pool to non-preemptible so that the pipeline will try to use as much resource as possible at the time when the resource becomes available instead of taken away by any other task in the same cloud zone.

Machine type availability

There are 4 types of machines available in google cloud at the time of writing
  • E2: max 32cpu/128gb ram. running either an Intel or the second generation AMD EPYC Rome
  • N2: max 80cpu/640gb ram. Intel Cascade Lake
  • N2D: max 224cpu/1772gb ram. AMD EPYC Rome
  • N1: max 96cpu/624gb ram. Intel Sandy Bridge, Ivy Bridge, Haswell, Broadwell, and Skylake

Due to cost and ram requirements. The pipeline operates on the N2D instance type. However, the compute zone's N2D instance type maybe not as abundant compared to the N1 type. 

We quickly spawn a new node-pool base on N1 type and the resource become readily available.

Resource allocation base on machine type

We configured the pipeline pod requirement in each step by enforce them to use a certain machine type instead of a generic require/limit to obtain the needed resource.

This posts a problem when a limited resource is available. For example, in the situation when only 30x16cpu machines are available and we have concurrent pipelines where pipeline 1 uses 30x16cpu machines and pipeline 2 uses 1x4cpu machine. If pipeline 1 acquires 30x16 cpu machine before pipeline 2. Pipeline 2 needs to wait until pipeline 1 finished and autoscale down before it can proceed and execute.

The autoscale down will contribute to time wasted and potentially when scaling down, the machines may get taken by other project's process.

For this particular event, we quickly switch other concurrent pipelines to use 16cpu machine instead of any other predefined type to mitigate the impact of limited resource

Prevention and Follow-up:

We can prevent the incident from several aspects:

  • multi-zonal cluster
  • resource label
  • pipeline sequence priority

Multi-zonal cluster

Upgrade from zonal cluster to Multi-zonal cluster allows us to have node-pools in different zones in the same regions.

In the event when limited resource available in zone 1, we can still spawn and consume resource from other zones

Resource label

Instead of hardcoded machine type that only assigns pod to designated machines. We remove this 1-1 labeling and only use require/limit label instead. This allows pod to assign to any machine if the machine meets the require/limit requirement.

Pipeline sequence priority

To reduce costs, we still want to use preemptible nodes. To be able to handle this, we need to reorder the pipeline steps and logics, so that heavy lifting happens at the beginning instead of at the end.

This allows us to have more time to handle failures when an incident happens in the heavy lifting part.

Comments