Description
Liangsheng Yin, Ying Sheng, and Lianmin Zheng (lmsys) present a novel optimization for constrained decoding of JSON or YAML in local LLMs (large language models). The method utilizes a compressed finite state machine that can be applied to any regular expression, accommodating any JSON or YAML schema. By analyzing the finite state machine of a regular expression and compressing singular transition paths, this approach decodes multiple tokens in a single step whenever feasible, significantly accelerating the decoding process. This optimization also makes constrained decoding even faster than normal decoding. The authors compare their method with existing systems such as guidance + llama.cpp and outlines + vLLM, demonstrating up to 2x reduction in latency and a 2.5x boost in throughput.