SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference
2023-07-05
-
Artificial Intelligence,
Information Processing | Computing,
Research
Luciano Del Corro, Allie Del Giorno, Sahaj Agarwal, Bin Yu, Ahmed Awadallah and Subhabrata Mukherjee propose a simple and effective token-level early exit method, SkipDecode, designed to work seamlessly with batch inferencing and KV caching. It overcomes prior constraints by setting up a singular exit point for every token in...