Your testing methods - Blue Tracker

#0 - Dec. 31, 2008, 7:51 a.m.

Can you enlighten us as to what your testing methods are when you "run your internal numbers"? Do they involve raid environments? Are there 24 other people testing other characters so you get a realistic idea of a typical raid environment? Do you test against mobs with typical raid boss characteristics? Do you test many different DPS rotations? What level of gear do you test with? Low, mid, high, all 3, and/or future levels? How much is in-game testing and how much is theorycrafted on spreadsheets?

Ghostcrawler

#14 - Jan. 1, 2009, 2:12 a.m.

Q u o t e:
Can you enlighten us as to what your testing methods are when you "run your internal numbers"? Do they involve raid environments? Are there 24 other people testing other characters so you get a realistic idea of a typical raid environment? Do you test against mobs with typical raid boss characteristics? Do you test many different DPS rotations? What level of gear do you test with? Low, mid, high, all 3, and/or future levels? How much is in-game testing and how much is theorycrafted on spreadsheets?

We do all of those things. Typically, we start with a few back of the envelope calculations and try it in game (kind of like you would do with a test dummy). If that seems like the right ball park, then we do the calculation to figure out real numbers for the ability, talent, enchant etc. in question. This is the "spreadsheet" step. Players deride it sometimes, but you can't balance a game the size of WoW by just guessing numbers that feel right. (On the other hand, part of being a good designer is also knowing when to ignore the numbers suggested by the spreadsheet).

Then we do some more extensive solo testing with and without buffs, against targets of various levels, and at different levels of gear.

Then we do two different kinds of more robust testing. One is more functional testing -- making sure the ability does what it's supposed to and behaves as expected with various talents, buffs, glyphs, set bonuses etc. Simultaneously, we do playtesting, where we have someone who knows the class very well in a raid or PvP setting. Depending on the magnitude of the change and our testing resources, we will also have less skilled players try out the change to see how it performs for them.

Once it's officially in the game of course, it still isn't done. Players are going to have feedback on almost any change, especially after they've had days or weeks with it in place. Despite the scope of our testing, it can't compare to the 11.5 million WoW players out there.

Now, I skipped the whole part about agreeing upon a change, since the question was about testing. That typically involves a lot of brainstorming, usually in a group setting. Once we agree on a change (and we almost always agree -- I can't think of a single change we made where some designers were on board with a change and others had serious reservations), we will bounce it off of several people, both at Blizzard, and other expert and casual people in the community as well.

When our testing is wrong, and that certainly does happen, it is usually for one of these reasons (and sometimes several at once):

1) Bugs inflated or deflated the numbers. Sometimes these are unknown bugs and sometimes they are bugs we know about but try and work around.
2) Players found a creative way to use an ability we didn't account for. This could be anything from a crazy talent spec, to stacking a particular stat, to a synergistic use of abilities from several different classes.
3) Closely related, sometimes we just underestimate player skill. This can particularly be challenging when we try to estimate how much better you can get with repetition. We might try something for a few hours, but a player who tries an ability for several months may get really good with timing, position or other skill-based factors.
4) Good old-fashioned math errors or other mistakes. These are rare, but they do happen.
5) Testing errors. Maybe our "expert" player wasn't as good as we thought. Maybe our sample size wasn't large enough. In a game the size of WoW you can't test every possible combination in a matrix (okay, now try a Draenei death knight blacksmith DW with Crusader vs. a night elf shadow priest... etc. etc.), and sometimes it's the strange stacking effects from different sources that end up being the biggest problem.
6) It's a big game. What I mean by that is our development has many moving parts. Maybe a new enchant was introduced after the testing that changes everything. Maybe a tweak to buff or nerf that enchant changes everything. Maybe fixing a bug way over here had unpredictable effects way over here. In an ideal situation, you lock everything down and test the entire game and every time you fix a bug or make a data change, you test everything again. Realistically that isn't possible.

Also keep in mind that sometimes players assume something is a bug that is actually be design. I try to clear this up if there is a lot of confusion over a particular issue.

Ghostcrawler

#53 - Jan. 1, 2009, 10 p.m.

Q u o t e:
WoW is a giant calculator. Nothing more, nothing less. That's why these spreadsheets work, and bugs notwithstanding, they accurately model the underlying mechanics so you can plug-and-play different gear, talents, professions, group compositions, boss armors, fight duration, debuffs on the boss, you name it, and discover how to get the most out of your class.

The spreadsheets work... to a point. You must keep in mind their shortcomings. These include, firstly, assumptions about how mechanics work that aren't always correct. Some mechanics are well understood. Others are assumptions, which while often based on large sample sizes, can still sometimes be wrong.

Second, they typically model "best possible" which includes very little lag, no movement and super-human timing. The more complex the class abilities, such as for hunters or warlocks, the harder it is for players to deliver on those best case scenarios. One of the frustrating things we have to deal with is when a player gets say 6500 dps out of a spreadsheet, can't do that on an actual boss fight, and then somehow blames us for the difference. Players forget that they can move out of range of buffs, that their pet is standing in the fire, that their trinket didn't proc as often as they expected, or that they died because they were so focused on their timers that they forgot to click that Health Stone.

Third, players can get different results, sometimes legitimately and sometimes disingenously to prove a point. For every case such as the Elemental shaman tests where our numbers were inaccurate, you can find players arguing that say the Retribution or DK nerfs were "to the ground," when in fact those characters seem to be doing very well, perhaps even too well.

We do take player tests and estimates very seriously, particularly when we can verify the skill, knowledge and motivation of the player involved. At the end of the day, we are still going to rely on our internal testing to a large degree. You can have a tee hee moment about the times when we've gotten it wrong, because that certainly happens, but in the overwhelming majority of cases, our numbers tend to be right.

Ghostcrawler

#73 - Jan. 2, 2009, 7:35 a.m.

Adrine, well stated. It may look different from my side of things, but for every case where the community is spot on with something, there is another where they're just... not. A quick gander at the forums even tonight (and even the posts with numbers) can demonstrate that. I was there throughout beta, and it wasn't as if we were getting a laser beam focus on a few outstanding issues that we needed to address. We were bombarded with requests (and numbers) ranging from reasonable to ludicrous. I don't fault anyone for this because it's exactly what we asked for (and still do). While we can research a player's history and skill and double-check all their numbers, that would be a tremendous amount of work for the signal to noise ratio we usually experience. Now we are investigating some more options to get numbers from reliable sources. We do that already on a small scale, and it would be nice to expand it.

Without going line by line, I don't agree with everything you said. There were some numbers out of whack early on in BC and there were more quick changes to fix things. The high threat generated by tanks isn't something we missed -- it's exactly what we said we were going to do. And so on....

But overall, nice post.

Q u o t e:
Don't let em forget this. I, too, saw all the spreadsheets and outcries on regular and Beta forums regarding all these issues that either went legitimately ignored or strewn aside. Shamans are pretty quick to point to how "bad" their class is but its happened in the past to a lot of other classes.

Yes, I believe all 10 classes were presenting us with evidence of how weak they were in PvP and PvE. Nobody was sitting back happy with their numbers and the couple of times I offered something like "Wow, I guess this class is in good shape," I was soundly corrected. :)

When we see a number of credible estimations that disagree with out numbers, that nearly always gets us to go back and recalculate ours. However I will also offer that if you try to get your handle on the state of the game from reading forum posts, you're going to have a pretty skewed view of things.

Ghostcrawler

#75 - Jan. 2, 2009, 8:40 p.m.

Q u o t e:
1) Do you and your co-workers use/review the spreadsheets/etc that get produced by the theorycrafting players?

Yes, we do. Sometimes we do it to help understand why there are discrepancies and sometimes we do it because we have characters too. :) I can think of at least two occasions where I mumbled something like "Hey, did you see how much priority this spreadsheet gives to [hit or whatever]?" and another designer said right back "Yes, that is a very credible spreadsheet."

Q u o t e:
2) Are there any changes planned for how the combat log uses timestamps, how entries are ordered, and how buff refreshes are handled? (This is mostly because of the confusion that arose regarding Clearcasting uptimes, and thus how effective the Elemental Oath change is going to be)

I can't point to any specific changes, but I understand the issues it caused with Elemental clearcasting in particular, so maybe we can get something worked out.

Q u o t e:

[edit]PS: What are you doing back at work so soon after New Years? :p

Technically I am still recovering from that last coffee mug of 2008. (Forgive the slurred typing.) But I sat out of Naxx tonight in order to get a jumpstart on some of what the community thought were pressing issues for when we are back. I have a good list.