The document proposes a low-cost and high-performance Montgomery modular multiplication algorithm and VLSI architecture. It uses a single-level carry-save adder to perform additions, reducing carry propagation and improving speed but requiring extra clock cycles. To address this, it introduces a configurable carry-save adder that can reduce the extra clock cycles by half. It also detects unnecessary additions to further improve throughput while maintaining short critical path delays. Experimental results show improved performance and area-time efficiency over previous designs.