A paper presented at SOSP 2025 details how token-level scheduling helped one GPU serve multiple LLMs, reducing demand from 1,192 to 213 H20s.
Read in full here:
A paper presented at SOSP 2025 details how token-level scheduling helped one GPU serve multiple LLMs, reducing demand from 1,192 to 213 H20s.
Read in full here:
Probably they want to start to be less dependent on NVidia GPUs.
Some companies in China is probably already building their own GPU that can rival NVidia’s.