Posts

Showing posts from November, 2020

google cloud credential/service account practice

Image
What is google service account? a service account is a type of Google account that is intended to represent a non-human user. It is typically used in running workloads on virtual machines. It handles authentication and authorization access to different Google APIs Types of service account in google cloud platform There exist a few types of service account User managed service account In user-managed service account, the account is manually managed, it is represented as a JSON file. This post problem on security where the accounts can get accidentally leaked due to device compromise or human-error. {   "type" : "service_account" ,   "project_id" : " project-id " ,   "private_key_id" : " key-id " ,   "private_key" : "-----BEGIN PRIVATE KEY-----\n private-key \n-----END PRIVATE KEY-----\n" ,   "client_email" : " service-account-email " ,   "client_id" : " client-id ...

machine learning pipeline batch prediction designs

Image
Introduction In the machine learning pipeline, we may face the following scenario where we apply machine learning (ML) prediction to $customer * item$ combinations in an attempt to select the best item for each customer according to some criteria. Flow 1 Given this scenario, we normally the following flow obtain customers data obtain items data run ML prediction on each customer-item combination rank the prediction per customer This flow makes both the logical sense (business sense) also the physical sense in terms of how the code is implemented. Flow 2 However, this is overly simplified in the real world. We often have constraints in the item or customer selection that limits which customer-item pair we can use in prediction. This constraint can be law, business contract, or downstream construct of how the selected customer-user combination should be used. This means that we have to apply selection criteria on customers and items such that the ML prediction only runs on the eligible p...

machine learning pipeline cloud resource incident

Image
Context: We own a core business machine learning pipeline that runs weekly. The pipeline generates customer-campaign-content matching across 10+million customers, 200+ campaigns. The pipeline also pass the final selected allocation to the downstream system to send out the  campaign send Issue Summary: The pipeline has to finish within 1 day due to the data availability timing and requirement of the downstream system. The majority of the pipeline is running on GKE preemptible machine. The parallel batch scoring step in the pipeline uses ~1000 containers/pods which requires ~66 x 64cpu 416ram machines. In one of our production run, we notice that only 1 machine get allocated, which means the runtime will increase from 20min to 30+hours. This will not meet the deadline requirement. Root Cause and Remediation: The root cause is due to  use of preemptible machine machine type availability resource allocation base on machine type Preemptible machine We use the preemptible machines t...