Create a kernel file for an inference service
What is a kernel file
A model kernel is a driver that loads a model and controls its behavior when the inference model is started, when the inference request is processed and when the inference model is stopped. The model kernel works together with the model file and weight file. It must be specified when creating a new inference service. Only published models can be used as an inference service, and these models must specify a kernel file and environment attributes which are used during model initialization.
High level functions
Kernel files can be written in Python or C++, and the kernel file must inherit from the kernel base class and it must include key functions to interface with the deep learning kernel.
on_kernel_start
: a function that is called during the inference service starting phase; loading it for future inference purposes.on_task_invoke
: a function that is called when a real inference request comes in. It runs inference after a model is started by on_kernel_start for one or more samples that the model can handle.on_kernel_shutdown
: a function that cleans up any temporary working files if needed. Example:hclass MatchKernel(Kernel): def on_kernel_start(self, kernel_context); def on_task_invoke(self, task_context); def on_kernel_shutdown(self);
Kernel file data structure
Kernel context and task context only pass parameters to a user-defined model kernel for a particular purpose of each kernel function.
- Kernel context:
-
Context information for model initialization with some important parameters like model weight file. It is defined during inference service creation and can be updated when inference service is stopped.
- Task context:
-
Context information includes the actual user input data and also a container for inference output. The task context could include the task input or output data for multiple tasks so it provides interfaces to go through those tasks. Each inference request can specify different input context.
How to create a kernel for an inference service
- Create your kernel class. All model kernels must inherit from the base Kernel class defined in
the elastic distributed inference SDK
(redhareapi).
import redhareapiversion from redhareapi import Kernel class MatchKernel(Kernel): ... if __name__ == "__main__": kernel = MatchKernel() kernel.run()
- Considerations for model start:When starting an inference service, the model kernel initializes model with the weight file, attributes and environments through following interface:
def on_kernel_start(self, kernel_context)
The inputkernel_context
passes parameters which can be used to initialize the model.model_desc = json.loads(kernel_context.get_model_description())
The primary work is to have PyTorch resnet18 model to load the model file:model_file = os.path.join(model_desc['model_path'], "model_epoch_final.pth") self.model = models.__dict__["resnet18"]() self.model.fc = nn.Linear(512, 10) self.model.load_state_dict(torch.load(model_file))
- Considerations for model invoke:Once the inference service is started, users can submit inference requests which are handled as tasks in the model kernel through following interface:
The intention of this interface is to do the actual inference job with the loaded model from the model start phase. Here we show a generic flow to achieve this purpose.def on_task_invoke(self, task_context);
- Loop through task context to solve all the combined inference
tasks.
while task_context != None: ... task_context_vec.append(task_context) task_context = task_context.next()
- For each task, retrieve the input data and some specific attributes for that task and send
prediction results
back.
input_data = json.loads(task_context.get_input_data()) req_id = input_data['id'] d = input_data["inputs"][0] #input_name = d["name"] data = d["data"] image_data = base64.b64decode(data) image_data = BytesIO(image_data) image_t = data_transforms(Image.open(image_data)).float() image_t = image_t.unsqueeze(0) outs = self.model(image_t) output_data = {"name":"output0", "datatype":"FP32", "shape": [1, 10], "data": outs.data[0].numpy().tolist()} task_context.set_output_data(json.dumps({"id": req_id, "outputs":[output_data]}))
- Loop through task context to solve all the combined inference
tasks.
- Considerations for model shutdown:If you want to do anything during inference service stop phase, add it to the on_kernel_shutdown method. In this example, the kernel shutdown is recorded in the log file.
def on_kernel_shutdown(self): Kernel.log_info('on_kernel_shutdown')
- Considerations for kernel logging include the following log interfaces. Logs are saved at
<EGO_path>/dlim/logs:
log_info(msg) log_warn(msg) log_debug(msg) log_error(msg)