Debugging Raid Lag in 25s

#1 - Aug. 19, 2013, 8:12 p.m.
Blizzard Post
TL;DR version: Healing is strongly implicated in causing significant "input lag" in 25 man raids. Temerity tested a number of theories this week and the data supports this conclusion and discredits DPSing, addons, RPPM, and even Stampede as causes. It seems to be correlated to the peak US raid times as well (presumably when many raids and LFRs are occurring).

Background:

As many 25 man raiders know, particularly those raiding Heroic content, there are times when the game client becomes very unresponsive. Abilities take far longer than they should to register as used, causing very frustrating gameplay for most classes and specs (particularly those who are riding each GCD rather than having longer spell casts). This happens primarily on the pull of many fights, but it occurs at other times as well.

It does not occur in 10 man raids. It occurs less in 25 Normal raids.

Here is a video emphasizing the issue. The perspective is of a Brewmaster Monk and a Beast Master Hunter. The video is slowed down to one third normal speed, and demonstrates some of the worst lag we've experienced lately. It also shows what a lag-free experience is like. Note the delays between the game client pressing a button (gold outline) and the game acting on it (resource consumption, cooldown trigger). It can be upwards of 1 second, and this is the source of our pain.

http://www.youtube.com/watch?v=Ff56_vfZ5sw

The Experiment:

We had many theories (RPPM being most people's prime culprit, followed by Stampede, and Addons) but we constructed an experiment to definitely prove what the cause is. We spent nearly two hours testing this in ToT this week, primarily on the training dummies before Jin'rokh, trying many variations:

1) No addons
2) No RPPM trinkets, set bonuses, meta, or enchants
3) No pets
4) No healing
5) Hunters (Stampede)

We did multiple pulls of each theory and privately recorded our own ratings of each pull, on a scale of 1-10 (1 being no lag, 10 being unplayable). Our goal was to see if patterns emerged without any communication tainting the results. After several experiments, we tallied our results and it was clear:

Healing causes the input lag.

Once we had this theory, we tested further. We tried different healers. We tried different numbers of healing cooldowns. We tried more or fewer temporary pets (such as Stampede). We tried pulls with healers doing nothing for the first minute, then going nuts. We even did a pull where only two players DPSd and healers went nuts. The clear result was healing of any kind, made worse by the number of healers red lining it. It had nothing to do with what DPS did, nothing to do with Addons people were running.

Other guilds have seen the same thing; in fact, every Heroic raiding guild probably sees it, perhaps depending slightly upon their healing composition. Cooldowns and smart heals seem to exacerbate the issue.

Our spreadsheet of results:

https://docs.google.com/spreadsheet/ccc?key=0ArK4-MdVBzk5dHAyTkVNOXJpRnhmWG43MFhubjlWZXc

The Second Experiment:

Our initial testing was conducted at the beginning of our normal raid week (Wednesday from 8pm-10pm PST for these tests). Late Sunday night (~11pm PST) we attempted to reproduce the problem and were much less successful. We were able to see it during Heroism, but it was not as pronounced as on Wednesday, and the game was smooth at other times during healing. From this, we theorize that it is heavily impacted either by the load on the servers themselves and/or network connectivity at the Blizzard data centers.

This means the lag likely occurs at peak raiding times (and probably most noticeably during the overlap of the East and West US coasts).

Conclusion:

Healing in 25s is very strongly implicated in causing the maligned "input lag" across the entire raid, made much worse during peak US raiding hours. While other factors may contribute in minor ways (composition, hybrid healing, etc), in our tests, healing or its absence dwarfed any other cause, and its absence left us with the silky smooth gameplay from previous expansions. We don't know what is happening inside the game client or on the WoW servers or on the network in between, but this is a huge, huge issue when it comes to player enjoyment. We expect it to get significantly worse in 5.4 when healers get their legendary cloaks and as gear scales.

What You Can Do:

Repeat our tests, see if you encounter the same problem. The easiest test is to simply do two fake raid pulls on the training dummies before Jin'rokh: one without healing, and one with healing. Take 10-15 minutes from your raid week, note the time and day and battlegroup, and see if you get the same result. The more data we gather, hopefully the more easily Blizzard can fix the problem.
Forum Avatar
Community Manager
#25 - Aug. 19, 2013, 11:57 p.m.
Blizzard Post
This is some pretty awesome sleuthing here, great work! We've been doing some similar testing over the past couple of weeks and we agree -- it looks like healing is the culprit of the dreaded "raid lag" that many guilds have been experiencing.

We do have a few adjustments coming down the pipe for 5.4 that should help. We're changing AoE heals to no longer affect insignificant guardians like Voodoo Gnomes or Wild Imps. We're also making a few adjustments to Healing Rain, Light's Hammer, Spinning Crane Kick, and Holy Word: Sanctuary that should hopefully clean things up a bit as well. Plus, now that we know the source of the issue, it'll be easier to make further adjustments if necessary.

Thanks again for putting so much effort into testing this!
Forum Avatar
Game Designer
#163 - Oct. 4, 2013, 10:02 p.m.
Blizzard Post
This is an issue that I personally care about tremendously, and it’s no exaggeration to say it’s discussed nearly daily in some capacity, among members of the design team and with our gameplay and server engineers. We made a number of changes to spells in patch 5.4, tweaking periodic and area-of-effect heals like Healing Rain to massively reduce the number of discrete healing events yielded by each spellcast while keeping overall throughput largely similar. Since 5.4 has released, we’ve continued to make performance improvements to effects like Soul Link (batching up multiple damage events into a single heal to you and your pet rather than triggering a dozen tiny heals when you AoE a group of creatures), and our investigation continues. We know that it doesn’t matter how well-designed or fun a boss encounter is in the abstract if you have to struggle with basic control elements while facing it.

Continued feedback and data from players who experience the specific type of input delays described in the excellent original post of this thread would be helpful and appreciated. But while I am no technical expert, it is important to note that “lag” can be a potentially misleading catch-all term that is used to describe a variety of different types of problems. The hallmark of the input delay issue described at the outset of this thread is that it occurs despite having consistently high FPS and low ping to the server. Some of the more recent reports of sharp FPS dips almost certainly have other causes. For example, for players reporting massive framerate drops when entering Norushen’s trials, the most likely cause lies with a UI mod that needs to be updated or replaced (likely a unitframe mod in particular). Other cases of poor framerate in the outdoor areas of the raid are more likely a video card or graphics settings issue. Those are valid concerns for feedback and discussion, but would better be directed to the Technical Support forums for discussion there.

If you’d like to discuss the specific type of performance issue outlined at the start of this thread, including updated examples of Siege of Orgrimmar fights where the problem is most apparent, or whether there has been noticeable improvement since 5.4, that would be very welcome.