After writing a few articles on using statistics to analyze computer systems I thought I should write down a simple rubric for evaluating studies found in the IT world. This is just a small set of the most common errors I find in performance analysis papers, capacity planning papers, and just about anything put out by the IT industry.
I’m begging all programmers, IT managers, testers, projects managers, secretaries, CEOs, CIOs, CFOs, CAO, CIA agents, FBI agents, and anyone else who reads a paper touting a product to go through this list and see how the paper compares. This “hit list” is more or less ordered by how severe the offense is, with the top three being an immediate dismissal of the paper as a load of crap.
Do your part to fight the menace of Marketing As Research by demanding these things from your vendors and your study writers.
You Might Be A Marketroid Zomboid Paper If…
The following list is a list of bad things about the paper in question. If you hit one and the paper is guilty then either just reject the paper outright or start asking the authors for the required information. If they’re full of it then they’ll give you lame excuses and you should then have your answer about their honesty and validity. But, sometimes people just don’t know better so don’t assume any malice initially.
- Paper is not authored by a specific person. A company has no problem putting its name on a paper. It has no reputation worth anything more than a 50 dollar filing for incorporation at the Department of Commerce downtown. You can’t corner a corporation at a conference and argue about the study. A person who’s a serious researcher would probably worry more about putting their name on a paper that could ruin their reputation. Without a single person or named group of people to blame for the study it’s harder to track down sources, evaluate qualifications, and confront people directly for their scientific opinions.
- Paper does not start off with a succinct question or set of questions to be answered which are exact and sufficient for the task. A question like, “Is Windows faster than Linux?” is simply wrong because its wording is leading and it is much too vague to measure well. A better question is, “Does Windows SMB CIFS networked files system transfer the same files at a better or worse rate than Linux SAMBA using the same hardware and network infrastructure?” Even that would need some further refinement most likely into a set of testable questions. Also notice my use of the word “same” over and over again?
- Data used in the analysis is not available. People make mistakes. If you don’t have the data used in the study then you can’t catch the mistakes.
- Tools used in the experiment are not freely available. This includes the source to any analysis scripts, tools, or test harnesses. Freely available does not mean without copyright or open source, but at least available for evaluation. A study is just a marketing brochure if it says anything like, “You too can shell out 100k and use our tool to do this kind of analysis.”
- There is no mention of how the experiment is designed to avoid confounding from other influences. Confounding is when you intend to measure one thing, but a bunch of other things are actually mixed into the measurement as influences. A classic example is measuring the performance of two web application architectures, but also using different databases, hardware, operating systems, and network cards. Really question the study if it claims to test one thing, and then you see lots of other components being changed between the systems (or, no mention of whether they explicitly avoided changing things).
- Averages are given without a matching standard deviation. An average is just useless without a standard deviation or similar measurement of variance. I could cook up 10 data sets with nearly the exact same mean or median, but with wildly different performance behavior. Demand standard deviation at a minimum.
- Paper does not mention all of the terms: minimum, median, mean, maximum, standard deviation, and possibly range. These all have exact definitions and are easy to calculate. If they are missing then someone is either clueless, lazy or a liar.
- Paper does not use the above terms correctly or calculates them incorrectly. Without the data you won’t know the second part, but these 6 statistical concepts are very simple to calculate and get right.
- Paper does not use some established statistical test to compare samples such as Student’s t-test or Wilcoxon test. If the data is normally distributed then they should use the t-test. Wilcoxon test is used for data with bad outliers or which isn’t normally distributed. Without a test comparing the two samples it’s too hard to see if they truly are significantly different or not. There also lots of other tests for various comparisons between sets of data, but these two are the most popular. If neither is used or even mentioned then something is wrong.
- Graphs use red and green. In American and many other western cultures red and green typically symbolize “bad” and “good” or “stop” and “go”. It really doesn’t mean anything, but people will place these negative or positive perceptions on the graphed entity without really realizing it. Smart people might not do this, but then again these studies are like phishing attacks. They aren’t written for smart people. Anyone using these two colors (or similar colors with negative meanings in a culture) is probably messing with the graphs and not letting the data speak for itself.
- Graph ranges and formats are seemingly arbitrarily chosen. It’s amazing how working and warping a graph can give your performance a major boost. If the graph axes are not equidistant and do not go up by a constant rate then someone’s cheating to make things look good. There’s a good concrete example of this at the end of the article.
- Sampling method and experimental design are not described as exactly as possible. Both of these are very standardized elements of any modern science, and there’s huge books full of various designs you can choose from. There’s even software to help you pick a design and verify that your sampling method will work with that design.
- Granularity used for measurements is too coarse to properly detect significant changes. If someone is measuring microsecond processes but using whole seconds then they are just wrong. With today’s modern computers and modern clocks there is just no excuse. It is appropriate to measure in fractions of seconds if these numbers are derived from
some finer measurement.
- Graphs and data do not show a random process fitting some distribution. Despite what your Computer Science professor told you, computers behave randomly. They may calculate the same number if you run a program multiple times, but the time it takes to calculate that number varies randomly within a distribution. If you look at a graph and it’s nice and smooth and shows this really great perfectly fitting line then the data is probably bogus or measured incorrectly. Real systems do not behave so cleanly and should have randomness around a specified mean.
- Erratic behavior in the results is not given an assignable cause and eliminated either by removing the same data from all comparison sets or by re-running the test. The classic is the initial ramp-up period all systems have. This initial period is usually very different from the system’s steady state operation, and the cause is directly assignable to system instability found at the beginning of all properly functioning systems. This means that it can be identified and eliminated from the data during analysis (but kept for people to see). It must also be done the same for all other data samples. If you drop the first 3 samples from one set, then you have to drop them from all others. It’s also good form to do the analysis with the data in tact and then refine it. Ultimately if the data is too erratic for just one system and not others, then the test for that system should be re-run.
- The paper does not use enumeration to discuss plans, processes, and designs. A classic way of hiding obviously stupid decisions is to bury them in deep paragraphs. If you take these paragraphs and convert them to bullet points or enumerated lists you can typically see very clearly if the process is poorly designed. Since it’s actually easier to write the process using enumeration and/or bullets, you should question a paper which does not do this. The author is most likely trying to hide something. This is of course subjective, since sometimes the experiment is so complex that it is really difficult to describe effectively with just a list. But that leads to my final point.
- The experiment, design, sampling, or other parts are not the simplest designs necessary to analyze the question at hand. This is again to avoid the “baffle them with bullshit” tactic. If the study is measuring complicated and big sounding things with complicated experimental designs then you should begin to worry. There is almost always a simpler measurement or experiment that would do better. Words like “user” are typically just a smoke-screen for a simpler measurement that makes the author’s view look bad.
Who The Hell Am I?
Who am I to say that this is a basic set of requirements for an analysis study? Well, I’m not getting these requirements from some arbitrary place, but from many books on properly displaying statistical data and graphical information. Read any book by Edward Tufte on displaying information, and just about any book on statistics for giving accurate information.
In addition, I’ve developed this list after years of reading, writing, and studying studies with these problems. I’ve even read entire books on so called “performance tuning” which violate all of these basic things. I’m just blown away how someone can write an entire book or article on performance tuning and not mention standard deviation once or show one run chart with a mean and +/- standard deviation lines.
A Concrete Example Of Graph Play
Problems with graphs are better shown than described. Here’s two examples from Microsoft Windows Server 2003 Standard Edition vs. Samba 3.0 and Red Hat Enterprise Linux ES 3.0 File Server Performance Comparison written by VeriTest (and showing them here is covered under fair use):
Bad Graph 1
Bad Graph 2
There’s loads of problems with these graphs, but notice how the Linux lines are red? Believe it or not this is probably done on purpose in some lame attempt to fool you into thinking Linux is bad. Even worse it works about as well as pricing soap at $1.95 instead of $2.00 to fool people into thinking it’s cheaper.
Also look at the axes and their layout. The first graph has the y-axis (left side) going in 50 increments, and the second graph has the y-axis going in 100 increments. This distorts the graphs to make it look like they are the same results, but actually they look very different when graphed properly. What’s worse is that the x-axis for both graphs is the same which means they are changing one scale (y-axis) without adjusting the other scale (x-axis). This creates a distorted graph.
Next, take a look at the x-axis and how it’s this weird 4 client step, but that the data points are between the “ticks” (little lines). This makes it harder to line up the x-axis with the y-axis and get an accurate view of each data point. Instead, the data points kind of float in space. This adds even more visual distortion making it easier to massage the graphs into submission.
These graphs violate almost every principle I’ve read about good graphs. I can understand a few sticking points, but these seem almost deliberately modified to give a good pretty curve that’s hiding something.
As I mentioned before, I would love for everyone in the IT industry to start demanding these things from studies they read. If you run into studies that you can share with me which you think I might want to compare against this list then send it on to me.