This tutorial aims to help users quickly deploy and experience UI-TARS model capabilities through UI-TARS Desktop and Midscene.
Use Inference Endpoints to deploy the model online when local resources are insufficient.
Visit Hugging Face Inference Endpoints to start
Choose an instance, here are our recommended choices for different model sizes:
Model size | GPU | num of GPU |
---|---|---|
2B | L4 | 1 |
7B | L40S | 1 |
72B | L40S | 8 |
Take 7B model as an example, we choose “Nvidia L40S 1GPU 48G” instance:
Set Max Number of Tokens (per Query) to 32768
Set Max Batch Prefill Tokens to 32768
Set Max Input Length (per Query) to 32767
Add CUDA_GRAPHS=0 to avoid launching failed. Check this issue for details.