January 28th, 2011
The previous post described the main testing mechanism of WebKit: layout tests. While it’s a (deceptively) simple yet powerful system, it’s not without its limitations. This post attempts to list some of the issues and areas for improvement.
Golden (expected) files
Each test has one or more “expected” output text or image files checked in alongside the test itself. For simple tests that assert some behavior, a basic text file with the “PASS” output is enough. However, for more complex tests, especially those that verify rendering behavior, the expected file is an image, and optionally a text dump of the render tree. Despite doing everything possible to ensure consistent output (always rendering a 800×600 image, using the same color space), the images can vary. For example, if the output has text (which gets anti-aliased), or form controls that have a platform-specific appearance, then the various platforms that WebKit is available on (Mac, GTK, Qt, etc.) will each need to have a different golden file.
There is a mechanism for handling per-platform expectations, but this does mean that changes that involve new or modified tests may need to worry about creating multiple golden files, often for platforms that the original developer doesn’t have access to. What usually happens is that changes are made with the expectation that tests will fail, and then when builders go red, new results are grabbed from them and used to update the checked in expectations (this is a process known as rebaselining, and there are tools to help).
One solution to this problem is “ref(erence) tests”, a concept borrowed from Mozilla. Instead of checking a pixel golden file, another HTML file is checked that attempts to arrive at the same result via a different (simpler, known to work) path. For example, if testing complex CSS float handling, the reference file would construct the same (pixel for pixel) layout using absolute positioning, which is (hopefully) an orthogonal codepath. Both files can be rendered to an image and compared, without having to worry about platform-specific output. Hayato Ito has been working on adding reftest support to the WebKit testing framework.
Having trustworthy tests helps to ensure both peace of mind (“was that a cosmic ray, or did I break something”) and ensure a smoother development experience (it’s no fun waiting for the commit queue to retry your patch because it ran into a flaky test). Flaky tests affect other projects too, and are perhaps an unavoidable problem in complex projects. In layout tests, they are most often caused by use of delays (i.e. setTimeout) that become brittle when test conditions change.
Julie Parent and Ojan Vafai had a flakiness crusade of sorts last year which helped a lot with this, but more help is always appreciated. Adam Barth and Eric Seidel have started to keep track of flaky tests and have the commit queue report them, and the Chromium WebKit port has a dashboard of flaky tests (in case you find it baffling, this page explains how to interpret it).
A sub-category of test flakiness is caused by test interdependence: some tests will pass when run alone, but will fail when run as part of the whole suite (or vice-versa). Tests are not entirely isolated, the binary that they run in (DumpRenderTree) is only restarted every 1,000 tests for the sake of performance, and though some things are reset between each test, it’s not feasible to ensure a complete tabula rasa. Sometimes this is caused by obvious things, like usage of
sessionStorage that is not cleaned up.
Other times, the interactions are much more subtle. For example, a patch that re-ordered some HTTP-level tests caused some entirely unrelated SVG tests to fail. It turned out that the reordering changed the chunking of tests (into groups of 1,000) and one test was triggering different kerning behavior in another due to overly coarse caching of some font rendering attributes (that has since been fixed).
One would think that with over 20,000 tests, coverage would be good. However, given the billions and billions of web pages out there, it’s a somewhat common occurrence to break seemingly “obvious” things. I have personally broken the back button on both Google and Facebook, and the set of regression bugs shows that I’m not alone. Thankfully most of these are caught via nightly builds, and bug fixes always come with a test of their own, so one can only hope that things are improving.
While layout tests aren’t all rainbows, puppies, and sunshine, they are an important part of the WebKit project. The web is an ever-evolving creature, and they help us code fearlessly. If any of the challenges presented in this post tickle your interest, layout tests are a great way to get involved in the project (for example, if overly long C++ compiles are not your cup of tea, you may like working on the Python framework that runs the tests instead).