One of the things that could make a huge difference in the speed of an algorithm is the choice of representation. Since many squarings are necessary, the cost of converting between binary and another representation only needs to be incurred once, so if there is even a slight speed-up due to a more efficient representation it is going to make a huge difference.
This thread is to collect ideas about different representations that participants might want to try implementing. I am going to start by suggesting a few that have popped up:
Montgomery multiplication 
Chinese remainder theorem based representation
This one is very exciting as squaring a number in the CRT representation is very cheap (as it can be done on the coefficients). In fact, the asymptotically fastest parallel algorithm for powering (computing
x is an
n-bit number) is based on this representation . The tricky part is doing modular reduction, one idea is that it can be done without converting back to binary but if it is practical remains to be explored .
What makes this representation potentially very powerful is the ability to do a number of squarings before reduction.
- Number Theoretic transform
Similar to CRT, Number theoretic transform (based on the Fourier transform, but using roots of unity in finite fields rather than the complex numbers) allows local squarings. I don’t know if reduction is possible in this representation is possible or if conversion back to binary is necessary, but even then it may still be promising as again several squarings would be possible.