This guide tell about online deployment using Hugging Face Inference Endpoints.

1. Online deployment

Use Inference Endpoints to deploy the model online when local resources are insufficient.

Deployment Steps

1. Access Deployment Interface

Visit Hugging Face Inference Endpoints to start

2. Configuration

Choose an instance, here are our recommended choices for different model sizes:

Model size GPU num of GPU
2B L4 1
7B L40S 1
72B L40S 8

Take 7B model as an example, we choose “Nvidia L40S 1GPU 48G” instance:

image.png

Set Max Number of Tokens (per Query) to 32768

Set Max Batch Prefill Tokens to 32768

Set Max Input Length (per Query) to 32767

image.png

Add CUDA_GRAPHS=0 to avoid launching failed. Check this issue for details.