There are a number of considerations above and beyond those typically used on a CPU for maximizing the performance achievable for a GPU accelerated PMEMD simulation. The following provides some tips for ensuring good performance.

Avoid using small values of NTPR, NTWX, NTWV, NTWE and NTWR. Writing to the output, restart and trajectory files too often can hurt performance even on CPU runs, and this is more acute for GPU accelerated simulations because there is a substantial cost in copying data to and from the GPU. Performance is maximized when CPU to GPU memory synchronizations are minimized. This is achieved by computing as much as possible on the GPU and only copying back to CPU memory when absolutely necessary. There is an additional overhead in that performance is boosted by only calculating the energies when absolutely necessary, hence setting NTPR or NTWE to low values will result in excessive energy calculations. You should not set any of these values to less than 100 (except 0 to disable them) and ideally use values of 500 or more. For a million atom system, just writing the restart file every 1000 steps can account for 10% of the run time with a modern GPU (the GPU is faster, but the disk writing is not). Ideally, NTWR > 100000 is ideal, or just let it default to nstlim.
Avoid setting ntave != 0. Turning on the printing of running averages results in the code means needing to calculate both energy and forces on every step. This can lead to a performance loss of 8% or more when running on the GPU. This can also affect performance on CPU runs although the difference is not as marked. Similar arguments apply to setting the value of ene_avg_sampling to small values.
Avoid using the NPT ensemble (NTB=2) when it is not required; if needed make use of the Monte Carlo barostat (barostat=2). Performance will generally be NVE > NVT > NPT. (NVT ~ NPT for barostat = 2, although modern GPUs may again outstrip their CPU management and take a greater performance hit).
Set netfrc = 0 in the &ewald namelist if feasible. This feature is new in Amber18 and is not something that will help energy conservation in NVE simulations. If anything, netfrc = 1 will unmask upward energy drift from other sources. It will help the thermostat to maintain the exact specified temperature, but if that is not critical then shut it off and save a few percent of the wall time in your calculation.
Set skin_permit = 0.75, also in the &ewald namelist. This new feature generalizes something that has always been in Amber. The default setting, 0.5, produces the original behavior, meaning that the pair list will be rebuilt immediately once any one particle has travelled up to half the margin of safety (the "skin") from its initial position for which the pair list is valid. If one particle has traveled half the distance, then another may have as well, and therefore it is possible that the pair list is not counting every interaction within the specified cutoff. Notice the wording there: even under the default setting, and historically, we can't absolutely guarantee that the pair list has never been violated, because we only rebuild the pair list after the possibility of a violation has been detected. It's extremely unlikely: perhaps one interaction in a trillion has been missed. However, with the new skin_permit input, we can ask that the pair list be rebuilt when a particle has travelled 75% (0.75, recommended) or even 100% of the margin of safety. Even if particles are allowed to travel that far, it takes two particles travelling towards one another to form an interaction that the exiting pair list will not know about. Our tests indicate that skin_permit = 0.75 will allow about one in fifty million interactions at the periphery of the cutoff sphere to be missed, and skin_permit = 1.00 will allow about one in a million to be missed. While we can detect a marginal amount of heating if skin_permit is turned up to its maximum value of 1.0, the lower value of 0.75 confers most of the benefit (fewer occasions to rebuild the pair list, and a speedup of as much as 10%) for negligible risk. As with any new feature, use with caution, but for very large systems (more chances for one particle to trigger a pair list rebuild) and longer time steps, we were seeing pair list rebuilds every other step, which we're confident is unnecessary.
Avoid the use of GBSA in implicit solvent GB simulations unless required. The GBSA term is calculated on the CPU and thus requires a synchronization between GPU and CPU memory on every MD step as opposed to every NTPR or NTWX steps when running without this option.
Do not assume that for small systems the GPU will always be faster. Typically for GB simulations of less than 150 atoms and PME simulations of less than 4,000 atoms it is not uncommon for the CPU version of the code to outperform the GPU version on a single node. Typically the performance differential between GPU and CPU runs will increase as atom count increases.
For some small systems, the GPU program will refuse to start unless supplied with an additional command line argument "-AllowSmallBox". This is deliberate: there is a known vulnerability in the pair list that can cause systems which are less than three times the extended cutoff in any direction to miss non-bonded interactions. For this reason, such systems are flagged. The CPU code will likely run them faster anyway given that the atom count is probably low anyway.
Turn off ECC (Tesla models C2050 and later). ECC can cost you up to 10% in performance. You should verify that your GPUs are not giving ECC errors before attempting this. You can turn this off on Tesla C2050 based cards and later by running the following command for each GPU ID as root, followed by a reboot:

nvidia-smi -g 0 --ecc-config=0 (repeat with -g x for each GPU ID)
Extensive testing of AMBER on a wide range of hardware has established that ECC has little to no benefit on the reliability of AMBER simulations. This is part of the reason it is acceptable (see recommended hardware) to use and why we recommend the use of GeForce gaming cards for AMBER simulations. For more details of ECC and MD simulations see the following paper: Betz, R.M., DeBardeleben, N.A., Walker, R.C., "An Investigation of the effects of hard and soft errors on graphics processing unit-accelerated molecular dynamics simulations", Concurrency and Computation: Practice and Experience, 2014, DOI: 10.1002/cpe.3232 (pdf).
If you see that performance when running more than one multi-GPU runs is bad, make sure you turn off thread affinity within your MPI implementation or at least set each MPI thread to use a difference core. For example, if you run two dual-GPU jobs and they don't both run at the speed of either job given free reign over the machine, you may need to make this change. MPICH may not have this on by default, and so no special settings are needed. However, both MVAPICH and OpenMPI set thread affinity by default. This would actually be useful if they did it in an intelligent way. However, it seems they pay no attention to load or even other MVAPICH or OpenMPI runs and always just assign from core 0. So 2 x 2 GPU jobs are, rather foolishly, assigned to cores 0 and 1 in both cases. The simplest solution here is to just disable thread affinity as follows:

MVAPICH: export MV2_ENABLE_AFFINITY=0; mpirun -np 2 ...
OpenMPI: mpirun --bind-to none -np 2 ...

System Size Limits

In order to obtain the extensive speedups that we see with GPUs it is critical that all of the calculation take place on the GPU within the GPU memory. This avoids the performance hit that one takes copying to and from the GPU and also allows us to achieve extensive speedups for systems of practical sizes. The consequence is that the entire calculation must fit within the GPU memory. We also make use of a number of scratch arrays to achieve high performance, meaning that the GPU memory usage can be higher than a typical CPU run. It also means, due to the way we had to initially implement parallel GPU support that the memory usage per GPU does NOT decrease as you increase the number of GPUs. This is something we hope to fix in the future but for the moment the atom count limitations imposed on systems by the GPU memory is roughly constant whether you run in serial or in parallel.

Since, unlike CPUs it is not possible to add more memory to a GPU (without replacing it entirely) and there is no concept of swap as there is on the CPU the size of the GPU memory imposes hard limits on the number of atoms supported in a simulation. Early on within the mdout file you will find information on the GPU being used and an estimate of the amount of GPU and CPU memory required:

  |------------------- GPU DEVICE INFO --------------------
  |
  | CUDA Capable Devices Detected: 1
  | CUDA Device ID in use: 0
  | CUDA Device Name: Tesla C2070
  | CUDA Device Global Mem Size: 6143 MB
  | CUDA Device Num Multiprocessors: 14
  | CUDA Device Core Freq: 1.15 GHz
  |
  |--------------------------------------------------------
  ...
  | GPU memory information:
  | KB of GPU memory in use: 4638979
  | KB of CPU memory in use: 790531

The reported GPU memory usage is likely an underestimate and meant for guidance only to give you an idea of how close you are to the GPU's memory limit. Just because it is less than the available Device Global Mem Size does not necessarily mean that it will run. You should also be aware that the GPU's available memory is reduced by 1/9th if you have ECC turned on.

Memory usage is affected by the run parameters. In particular the size of the cutoff, larger cutoffs needing more memory, and the ensemble being used. Additionally the physical GPU hardware affects memory usage since the optimizations used are non-identical for different GPU types. Typically, for PME runs, memory usage runs:

NPT > NVT > NVE
NTT = 3 > NTT = 1 or NTT = 2 > NTT = 0
Barostat = 1 > Barostat = 2

Other than size, the density of your system and the non-bonded cutoff are perhaps the most important factors in memory usage. They drive the amount of space allocated for pair list building, and can be thought of as two sides of the same coin. A very dense system with a short cutoff packs as many particles within the cutoff sphere as a lower density system with a long cutoff, and a dense system with a long cutoff is double trouble. The following table provides an approximate UPPER BOUND to the number of atoms supported as a function of GPU model. These numbers were estimated using boxes of TIP3P water (PME) and solvent caps of TIP3P water (GB). These had lower than optimum densities, so the actual limits for dense solvated proteins may be around 20% less than the numbers here. Nevertheless, these should provide reasonable estimates to work from.

GPU Type	Memory	Max Atom Count
GTX580	3.0 GB	1,240,000
M2090 (ecc off)	6.0 GB	2,600,000
GTX680	2.0 GB	920,000
K10 (ecc off) / GTX780 / GTX980	4.0 GB	1,810,000
K20X (ecc off) / GTX-Titan / GTX-Titan-Black	6.0 GB	2,600,000
K40 / K80 / M40(ecc off) / GTX-Titan-X / GTX980TI / Titan-XP	12.0 GB	3,520,000
GTX-1080	8.0 GB	2,800,000

Statistics are only provided for periodic systems. In theory, a GB implicit solvent simulation system has memory requirements that could exceed the card, but the systems would have to be so large (hundreds of thousands to millions of atoms) that simulating them with GB solvent (effort growing with the square of the system size) would be impractical for other reasons.

Accuracy Considerations

In the current generation of GPUs, single precision arithmetic gets substantially more throughput. This is important when trying to obtain good performance from GPUs. The CPU code in Amber has always used double precision throughout the calculation. While this full double precision approach has been implemented in the GPU code (read on), it gives very poor performance and so the default precision model used when running on GPUs is a combination of single and fixed precision, termed hybrid precision (SPFP), that is discussed in further detail in the following reference:

Scott Le Grand; Andreas W. Goetz; & Ross C. Walker* "SPFP: Speed without compromise - a mixed precision model for GPU accelerated molecular dynamics simulations.", Comp. Phys. Comm, 2013, 184, pp374-380, DOI: 10.1016/j.cpc.2012.09.022.

This approach uses single precision for individual calculations within the simulation but fixed scaled integer precision for all accumulations. It also uses fixed precision for SHAKE calculations and for other parts of the code where loss of precision was deemed to be unacceptable. Tests have shown that energy conservation is equivalent to the full double precision code and specific ensemble properties, such as order parameters, match the full double precision CPU code. Previous acceleration approaches, such as the MDGRAPE-accelerated sander, have used similar hybrid precision models and thus we believe that this is a reasonable compromise between accuracy and performance. The user should understand that this approach leads to rapid divergence between GPU and CPU simulations. The CPU code's simulations will diverge from one another if run in parallel because there is floating point accumulation (albeit 64-bit) and no control over which thread adds to each number in order. The GPU code, in contrast, is tremendously parallel but the integer fixed-precision accumulation ensures that its simulations are self-consistent. The divergence is cosmetic in the end: any statistical mechanical properties should converge to the same values.

Another precision mode, Double-Precision calculation / Fixed Precision accumulation (DPFP), is also built at compile time. This mode approximates the CPU code, with the added benefit of reproducible results thanks to the integer accumulation. It would be the model of choice if it weren't three to four times slower on the expensive Tesla cards (i.e. GP100) and ten times slower on the GeForce (i.e. GTX-1080Ti) cards. It primarily exists for testing and debugging purposes.

We recommend using the CPU code for energy minimization

One limitation of either GPU precision model is that forces can be truncated or, in the worst case, overflow the fixed precision representation. This should never be a problem during MD simulations for any well behaved system. However, for minimization or very early in the heating phase it can present a problem. This is especially true if two atoms are close to each other and thus have large VDW repulsions. We recommended using the CPU version of the code for this short process. Only in situations where the structure is guaranteed to be reasonable, for example if it was a snapshot from dynamics, should you use the GPU code for minimization.