The document proposes a Montgomery multiplication algorithm and architecture that uses carry-save adders to perform modular multiplication with binary operands in a low-cost, high-performance manner. It uses a single-level carry-save adder to avoid carry propagation during addition and to perform operand precomputation and format conversion between carry-save and binary formats. To reduce extra clock cycles, it introduces a configurable carry-save adder that can operate as a full-adder or two half-adders. It also develops a mechanism to detect and skip unnecessary carry-save additions to further reduce clock cycles while maintaining short critical path delay. Experimental results show the proposed design achieves higher performance and area-time efficiency than previous Montgomery modular multipliers.