|
![]() |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Ruby/Odeum vs. Lucene Part 2This article attempts to practice what I preach in Programmers Need To Learn Statistics Or I Will Kill Them All by doing my original Ruby/Odeum vs. Lucene analysis in a more formal way. There were several complaints from the Lucene crowd which I try to address, and I also bring up problems with suggested improvements from several other people. I’ll be using the R programming language to do the analysis and produce the graphs. Also, this essay won’t be as funny as my rant (if it was funny that is). Problems With Part 1People mentioned a few problems with my first performance analysis of Ruby/Odeum and Lucene. If you read that essay I did cover these as weak points:
Executive SummaryFor the people who have no clue (also known as “Executives”) here’s the information you need to tell all your employees they need to adopt the latest and greatest thing without ever having to understand anything you read. Cheaper than an article in CIO magazine and even has big words like “standard deviation”.
My Performance Analysis ProcessI’m going to give my method for conducting an experiment to determine the performance of a system. It is based on your typical scientific method, but I’ve adapted it slightly to accommodate how a computer works. With a computer I have the advantage that I can pretty much sample as much as I want from any particular component. In a biological experiment or a manufacturing control sampling you’ll end up destroying the items you sample so you’re limited to random selection. My process is pretty simple:
You’ll notice that I don’t mention developing a hypothesis in this process. The use of hypothesis testing in statistical analysis is actually a little controversial, with the majority of the scientific community using a hypothesis method, and folks in statistical process control avoiding them. The argument against using an hypothesis is that it will influence your experimental design and doesn’t really prove anything. The argument for using an hypothesis is that you need to have something to prove otherwise you can’t make a valid statement in your analysis. I’m more practical than either of these view points. I don’t use a hypothesis if I’m simply comparing the performance of some systems or gathering data to find out what is going on. I do use a hypothesis if I’m trying to improve the performance of a system. I’ll simply say, “My hypothesis is that this change will be a statistically significant improvement in performance over the previous version.” I’ll also develop an hypothesis when someone else makes a claim that their system is better than another one, but this is again implied in the two data sets. Research PlanThe big question I want to study is, “How do Ruby/Odeum, Odeum, and Lucene compare in querying performance and memory consumption?” That’s pretty much the gist of it, but this isn’t really exact enough for our purpose. This question is much too vague to build up a decent experimental design. What we need to do is refine the question into very specific variables that we can measure and test in a consistent and clear way. Refining The QuestionThere’s actually a whole process you can follow to refine a hypothesis or research question, and it’s remarkably similar to the process of doing an object-oriented domain decomposition. If you’re really interested read Methods of Social Research Bailey. It’s written for social science students, but it has a very good explanation of developing this kind of question. I’ll just cut to the chase and layout the refinement before people get even more bored. What we really want to know is:
Including the direct QDBM Odeum (which is what Ruby/Odeum wraps) lets me see what is possible with Ruby/Odeum, and also lets me get a kind of baseline performance since QDBM Odeum is written in pure C. This leaves us with two variables we’re analyzing:
Finally, I have a few assumptions and constraints I’m going to follow:
Experimental SetupMy system configuration is the same as the previous informal study and only the test process was altered. The system is:
Initial Data GatheringOne of the biggest mistakes people make (and companies make on purpose) is to not verify that their sample sizes are good enough to support their accuracy requirements. Not taking enough measurements is like trying to measure microseconds with 16-bit integers. You could get lucky and have microseconds that fit in the chosen 16 bits, but you have no idea if this is true or not. Using an arbitrary sample size for your performance measurements is similar since you could choose way to many or way too few. There is a chicken-and-eggs problem though since you need to have some data in order to make sure your data is valid. normally you would make an educated guess about the sample size or run a few simulations. My plan is to do a small set of 10 sample runs of 100 each, and then use that data to figure out an optimal set of sample runs and counts. I’ll take the samples, do a few useful graphs of each the sample means, and then use a power test to verify that I have a good sample size. Statistical Power TestThe graphs and data for this initial run are available in a separate data directory for people to review. The main thing is that if I do a boxplot of the three time samples I can see that Lucene is going to end up being the one with the widest range.
What this graph shows is that Lucene has a slightly wider range (it’s box is fatter), so when I do my statistical power test the Lucene runs will require more than the other systems. To do my power test I’m using the power.t.test function in R. This function takes information about the kind of accuracy you want and the standard deviation you expect, and then tells you the number of samples you’ll need per sample run. > lupower <- power.t.test(delta=0.001, sd=sd(lut$time), sig.level=0.01, power=0.9) > odpower <- power.t.test(delta=0.001, sd=sd(odt$time), sig.level=0.01, power=0.9) > ropower <- power.t.test(delta=0.001, sd=sd(rot$time), sig.level=0.01, power=0.9) > c(lupower$n, odpower$n, ropower$n) [1] 4081.330 1424.565 1706.061 Looks like Lucene’s run needs a lot more than the others, so we’ll end up having to use Lucene’s power requirements. An additional concern is that the data is not really normally distributed, which sucks because power.t.test is intended for that distribution. A general rule of thumb is that you can multiple the test results by 1.3 and cover yourself for any other distribution. I’ll just do this since computing is cheap and I can run the test over night. Our sample size will be 5306 samples, but let’s just say 5500 for good measure. We’ve now got our sample sizes, but how many actual samples (runs) do we need? We can again use the power.t.test on the “sample means” to figure out how many we need. What’s a sample mean? The theory of sampling says that if we do a bunch of independent samples, and then take something like the mean or median, then this series of measurements will be normally distributed no matter how the actual data is distributed. In other words, if I just do my 5500 samples once then I have only one sample and one mean. If I repeat this 5500 say 10 more times, take the mean of each sample run, then I have 10 sample means. These sample means will be normally distributed as I take more of them (closer to 30). I’m really only interested in the sample means in this analysis. The blue line is the mean, and the red line is two standard deviations from the mean.
A graph like this will let you easily see the ramp-up period prior to reaching the steady state of the system. It would not show you possible performance problems from assignable causes. For that you’d want to use this chart to find the steady state, then randomly pick one of the sample runs and do a complete run chart of all the samples. Anything that is outside the red lines (two standard deviations) is suspect and you should check it out. If we run the power.t.test on this data then we can get an idea of how many sample runs we’ll have to do to get good accuracy:
> power.t.test(delta=0.001, sd=sd(lus), sig.level=0.01, power=0.9)
Two-sample t test power calculation
n = 125.6942
delta = 0.001
sd = 0.002041483
sig.level = 0.01
power = 0.9
alternative = two.sided
NOTE: n is number in *each* group
Alright, that’s ridiculous, but that’s what the math seems to support. According to this we’ll need to run 69300 total timing tests to get even close to the accuracy we want and Ruby/Odeum being the slowest takes 0.032 seconds for each query. That means a quick calculation: > mean(ros) * 126 * 5500 / 60 / 60 [1] 6.179075 This is the mean time of Ruby/Odeum, times the number of sample runs, time the number of samples, then divided by 60 / 60 to get the number of hours. And it’ll take 6.18 hours to run this test for just Ruby/Odeum. That’ll take forever probably pushing 12 hours with the other two runs included. Back To RealityWhat if we don’t have 12 hours to run this test? In our case we have a couple of choices:
If I have a choice, I prefer to reduce the size of the runs but try to do as many runs as I can. We’ll do this by accepting a different set of acceptable accuracy. I re-ran the power.t.test functions with delta=0.002 and sig.level=0.05 and came up with 22 sample runs of 721 each. This should take about half an hour to run. I actually prefer to run the longer experiment in this case so that people from the Java camp don’t complain about JIT compilation and Hotspot usage and other Java start-up annoyances. As you’ll see later this is a good choice since it demonstrates the ramp-up period really well. Second Stage Data GatheringThe second stage data collection was run using a set of shell scripts and then an analysis R script. The shell scripts just run each of the different programs in a consistent way. It’s harder to run your tests consistently if you don’t script them. The gist of the analysis is summarized in the following tables.
And some ratios comparing the mean and standard deviation (SD) for the different systems:
These tables are actually very misleading for a reason I’ll cover in the Results section. Simply take them as the raw data results which we’ll refine later in the Results section. ResultsThe first thing that I have to do is graph these three so I can see if there’s anything wrong with their data. If I do a simple comparison chart of the sample means you can see the first major problem:
Right away we have a ramp-up problem which is making the Lucene results look much worse than they actually are compared to the other two. Simply removing the first 2 runs would get rid of the ramp-up for all three and give us a better idea of the steady-state performance.
This also changes our previous two comparison tables, summary of statistical attributes:
And the ratios comparing the mean and standard deviation (SD) for the different systems:
That changes the standard deviation (SD) for Lucene in a big way, making it actually better than Ruby/Odeum. If we include the ramp-up period then it looks like Lucene is all over the map and has really poorly behaving performance. What I’ve done here is slightly modified our question to say we only are interested in the data collected after any ramp-up period. Since our comparison graph shows that Lucene has a huge ramp-up of nearly 2 full sample runs (2 sets of 5500 queries each) we get a better result by dropping these as if they were outlier data. Is this a fair thing to do though? Having a system that doesn’t start performing until it’s done some 11000 queries may seem a little ridiculous, but according to this data that’s about 3 minutes of constant pounding. The other two systems (Ruby/Odeum and Odeum) get up and running at a steady state right away, and what’s interesting is their performance gets slight worse after the first run (see how the Minimum is higher in the second set of tables?). Overall I prefer to show both sets of data and let people decide for themselves. If you’re running Lucene in short bursts (such as in a command line tool) then this ramp-up period is probably the only performance you’ll ever see. If you run it as a server than you’ll see the steady-state performance after about 3 minutes of constant activity. Memory ResultsMemory is kind of strange on most systems since there’s not too clear of an idea of what is considered actually used memory vs. just what the operating system has assigned to that process. Many operating systems will keep recently freed memory around in order to hand it back to the process for more usage. Typically you’ll see the memory increase for a while and then steady out. Finally, memory for runs like this should stay fairly constant over time. Our memory data shows pretty much this behavior, and gives a shocking memory result (which I’ll explore more to rule it out as a problem).
This is the VSZ + RSS (in kbytes) sampled at 1 second intervals while each test ran. The SD for these isn’t really that big of a deal. It would matter if we saw a really huge SD for one of them which would imply a really bad memory usage pattern I’d want fixed before I did more comparisons. I received a few comments from Erik Hatcher (author of Lucene in Action a fine book I’ve been inclined to buy) about the memory results for Lucene bringing up some issues. A few other people from Lucene also complained about the memory part (and so far really only the memory part). Even some folks going so far as to say “he still doesn’t have a clue.” Also people said that all I did was prove that the default memory settings for the chosen JVM were bad. I decided to do a deeper analysis of the memory consumption that Lucene produces. Let’s just test out this assumption that these memory usage ratings are based on the default setting for -Xmx and -Xms. Hell, let’s even test out as many different JVMs as possible. I re-ran the tests with these changes:
Doing this produced the following table of memory consumptions by JVM.
This table doesn’t show the mixture of -Xmx and -Xms settings, but this wonderful coplot sure does:
You read the coplot by finding the two gray bars on the top and right side for -Xms and -Xmx, and then find where they intersect in the center. That is then the graph of JVM by total memory consumption for those two settings of -Xms and -Xmx. If a JVM is not listed on that panel then it didn’t have a measurement. This is because of the JVM aborting due to a setting it considered invalid. Some JVMs also reported an error for the minimum setting and then adjusted it to the maximum setting, but this didn’t seem to change the memory usage levels when I analyzed it. In the end, after all this work, what we show is that most JVMs are much lower on the memory consumption scale the original Sun JDK. Sun’s JVM is just horrendous and a total pig, but even still the best memory consumption is still at 26570k. This absolute lowest possible memory consumption is still at least 3.6 times the Ruby/Odeum levels. Another point I’d like to make (which seems to get lost on people who read this part of the study) is that I really don’t consider this Lucene’s problem. This is a classic JVM issue that’s been around forever. No matter how you slice it Java is a fat ugly ass memory hog. This really demonstrates the validity of how I’m doing this experiment. With other performance analysis reports you’d have to kill someone to get at their data and source code. This makes it harder to prove alternative test cases or that the author has run the experiment fairly and accurately. With this experiment everything is open so folks like Erik who know way more about Lucene than I do are able to take the code and simply show me I’m wrong. If he can get the memory consumption down then he’s done and I can re-post the results. No need to argue, the data wins out every time. It’s weird how 3 and 6 keep showing up in this analysis. I swear I’m not fixing the data. ConclusionsThe initial conclusions are simply summarized as:
This information is supported by the data so I won’t go into it much. I do have some additional observations which I’ll call my “opinions” rather than an inference. I prefer to let people take the data and scripts I’ve written and do their own analysis. I also encourage people to suggest improvements and expand on this analysis. OpinionsRuby/Odeum’s performance compared to the QDBM Odeum shows that the Ruby binding is cutting the performance in half. I can think of a couple of reasons why this could happen, but the primary one I plan to investigate is how the query function returns all of the results as a large Ruby array. This involves a lot of data copying and the each operator might not be the best performing way to do this. An alternative is to have a simple iterator setup that gets the documents directly, pushing the iteration of document IDs back to the C interface. I’ll actually be testing this in the next release to see if it improves performance enough to justify the change to the API. Because Ruby/Odeum’s standard deviation (SD) after reaching steady-state is higher than the other systems, I predict that it will not perform as well under a heavier load. My experience says that a system with a high SD during testing will degrade much quicker when the load is increased. This is just a hunch, but I think a good test would be to make a green threads version of the Lucene system and then run Ruby/Odeum and Lucene tests again with a large number of threads. The reason I’d use green threads in Lucene is that’s the closest thing to Ruby’s threading system. The QDBM Odeum performance is so tight that I wouldn’t even bother with this test. In general I’m very impressed with Lucene’s performance, and also very happy with Ruby/Odeum’s performance. Considering that QDBM Odeum is faster than Lucene there’s hope that Ruby/Odeum can become even more competitive with some additional work. Still, I feel like this is arguing over nothing since we’re talking tenths of a second differences to search 18174 documents for a single word. My overall recommendation is that Ruby/Odeum seems to compete reasonably well with Lucene, and has advantages in memory footprint. I caution people to not take these results and assume that these systems will perform the same when integrated with their software. You should re-run these same tests with your own harness if you have concerns about the performance of these systems in new situations. Otherwise, you can probably go with this data and be reasonably secure that Ruby/Odeum or Lucene is fast enough for most of your search needs. Available Code & DataEverything I used is available for others to run, and all of the tools I used are free. You are free to take these programs and use them for your reference, copy them, send them around whatever. I donate them to the public domain unless someone else had a license first (like odmgr.c and SearchFiles.java). You’ll notice this is different from most other “performance studies”. Other studies hide the data and results from you in order to sell you something. I want you to make an educated decision and to improve the analysis to suit your needs. You should start demanding this information from your vendors and study authors as without you can’t verify that they weren’t cheating.
CreditsI’d like to thank the following people for reviewing this article for accuracy, grammar, and spelling: |