building resilience machine learning pipeline

The field of machine learning engineering is still young. There's so much more to learn from other fields of engineering.

Besides trying to make it more accurate, faster and cheaper, we should also pay attention to the 4th management objective: resilience.

Resilience is the intrinsic ability of a system to adjust its functioning prior to, during, or following changes and disturbances, so that it can sustain required operations under both expected and unexpected conditions. Since resilience is about being able to function, rather than being impervious to failure, there is no conflict between productivity and safety.

fault tolerance, redundancy, unit test, e2e test are some of the terms we think of when it comes to building a resilience pipeline. However, resilience does not come from the test, it comes from design. With enough tests, we can cover different scenarios, but the pipeline can be still brittle and fragile, and prone to bugs in change.

A lot of the failure we observed can be due to human error. However, human error should not be the root cause, instead that's where we start

Root cause analysis can be easily misinterpreted and abused. Instead of focusing on what's wrong, we can instead look at what's good and mitigate the issue by improving the good.

Comments