- New quantization algorithm called HAWQ-V3 that uses only integer multiplication, addition, and bit shifting for inference, with no floating point operations or integer division
- Achieves higher accuracy than prior work, including up to 5% higher than Google's integer-only method, with no accuracy degradation for INT8 quantization
- Proposes a novel ILP formulation to find optimal mixed precision of INT4 and INT8 that balances model size, latency, and accuracy
- Implementation in TVM demonstrates up to 1.5x speedup for INT4 quantization compared to INT8 on Nvidia T4 GPU tensor cores