#14 - Jan. 1, 2009, 2:12 a.m.
Q u o t e:
Can you enlighten us as to what your testing methods are when you "run your internal numbers"? Do they involve raid environments? Are there 24 other people testing other characters so you get a realistic idea of a typical raid environment? Do you test against mobs with typical raid boss characteristics? Do you test many different DPS rotations? What level of gear do you test with? Low, mid, high, all 3, and/or future levels? How much is in-game testing and how much is theorycrafted on spreadsheets?
We do all of those things. Typically, we start with a few back of the envelope calculations and try it in game (kind of like you would do with a test dummy). If that seems like the right ball park, then we do the calculation to figure out real numbers for the ability, talent, enchant etc. in question. This is the "spreadsheet" step. Players deride it sometimes, but you can't balance a game the size of WoW by just guessing numbers that feel right. (On the other hand, part of being a good designer is also knowing when to ignore the numbers suggested by the spreadsheet).
Then we do some more extensive solo testing with and without buffs, against targets of various levels, and at different levels of gear.
Then we do two different kinds of more robust testing. One is more functional testing -- making sure the ability does what it's supposed to and behaves as expected with various talents, buffs, glyphs, set bonuses etc. Simultaneously, we do playtesting, where we have someone who knows the class very well in a raid or PvP setting. Depending on the magnitude of the change and our testing resources, we will also have less skilled players try out the change to see how it performs for them.
Once it's officially in the game of course, it still isn't done. Players are going to have feedback on almost any change, especially after they've had days or weeks with it in place. Despite the scope of our testing, it can't compare to the 11.5 million WoW players out there.
Now, I skipped the whole part about agreeing upon a change, since the question was about testing. That typically involves a lot of brainstorming, usually in a group setting. Once we agree on a change (and we almost always agree -- I can't think of a single change we made where some designers were on board with a change and others had serious reservations), we will bounce it off of several people, both at Blizzard, and other expert and casual people in the community as well.
When our testing is wrong, and that certainly does happen, it is usually for one of these reasons (and sometimes several at once):
1) Bugs inflated or deflated the numbers. Sometimes these are unknown bugs and sometimes they are bugs we know about but try and work around.
2) Players found a creative way to use an ability we didn't account for. This could be anything from a crazy talent spec, to stacking a particular stat, to a synergistic use of abilities from several different classes.
3) Closely related, sometimes we just underestimate player skill. This can particularly be challenging when we try to estimate how much better you can get with repetition. We might try something for a few hours, but a player who tries an ability for several months may get really good with timing, position or other skill-based factors.
4) Good old-fashioned math errors or other mistakes. These are rare, but they do happen.
5) Testing errors. Maybe our "expert" player wasn't as good as we thought. Maybe our sample size wasn't large enough. In a game the size of WoW you can't test every possible combination in a matrix (okay, now try a Draenei death knight blacksmith DW with Crusader vs. a night elf shadow priest... etc. etc.), and sometimes it's the strange stacking effects from different sources that end up being the biggest problem.
6) It's a big game. What I mean by that is our development has many moving parts. Maybe a new enchant was introduced after the testing that changes everything. Maybe a tweak to buff or nerf that enchant changes everything. Maybe fixing a bug way over here had unpredictable effects way over here. In an ideal situation, you lock everything down and test the entire game and every time you fix a bug or make a data change, you test everything again. Realistically that isn't possible.
Also keep in mind that sometimes players assume something is a bug that is actually be design. I try to clear this up if there is a lot of confusion over a particular issue.