This guide tell about online deployment using Hugging Face Inference Endpoints.
Use Inference Endpoints to deploy the model online when local resources are insufficient.
Visit Hugging Face Inference Endpoints to start
Choose an instance, here are our recommended choices for different model sizes:
Model size | GPU | num of GPU |
---|---|---|
2B | L4 | 1 |
7B | L40S | 1 |
72B | L40S | 8 |
Take 7B model as an example, we choose “Nvidia L40S 1GPU 48G” instance:
Set Max Number of Tokens (per Query) to 32768
Set Max Batch Prefill Tokens to 32768
Set Max Input Length (per Query) to 32767
Add CUDA_GRAPHS=0 to avoid launching failed. Check this issue for details.