Jach's personal blog

(Largely containing a mind-dump to myselves: past, present, and future)
Current favorite quote: "Supposedly smart people are weirdly ignorant of Bayes' Rule." William B Vogt, 2010

Some thoughts on test automation

I found an old piece of paper from ~2014 that had some bullet points on testing philosophy I had back then, mostly to help myself articulate things in interviews. Reading through the list I still believe a lot of them, but others not so much or I have more qualifications. Let's look at them, elaborate, and add some thoughts not present then.

* Code should be designed for testing -- create hooks/test points (if hardware) if necessary. Want to facilitate "white-box" testing.

Mostly makes sense to me. This has larger implications though depending on language and test type and how exactly you're designing for tests. The very worst design idea is to have code like "isRunningTests()", and to make worst worse have code inside such blocks doing things like "getTestContext().getProperty("blah")" to use for the code instead of whatever the real thing was.

In Java, this typically means you want to avoid static methods, unless such methods are purely functional. This is because Java makes it hard to mock static things. You want to avoid using tools like PowerMock, especially if you own the code under test.

Designing with mock-avoiding unit tests in mind not only helps test efforts but can often make the code clearer.

* Tests have bugs too

One of the arguments for testing at all is that it's hard to write working code so you need tests. But tests are code too -- does the same criteria not apply? How do we avoid an infinite loop of testing our tests, and testing our tests that test tests, etc.?

I mostly think of this as a counter to test-all-the-things mindsets. Lots of quality software has been made with 0 test automation, and lots of shitty software has been made with lots of test automation. Testing isn't a silver bullet.

It does argue for tests to be non-clever, though. It's harder to mess up if you're writing the same sort of code over and over, to some set pattern. Hence the small discipline of organizing each test into three phases (Arrange-Act-Assert, Given-When-Then, Setup-Execute-Verify or whatever other acronym you prefer) and trying for one assert per test not only makes test code easy to see what it's doing but by limiting expressiveness of programs we can more reliably write them, or at least the bugs they still inevitably have will tend to be simple.

* Low-risk tests have low payoff (cost-benefit analysis, long term vs short term)

I somewhat still think this. If you write a test and 5 years later that test has never failed, what is its value exactly? Is it even possible for the test to fail? Such tests impose a cost on the test automation infrastructure, what exactly is their benefit?

There is still some value in providing coverage -- the code satisfies the test's assertions and continues to do so. And there is value in acting as a regression guard. Especially after 5 years, if someone goes and touches stuff there for the first time in that long, it's quite useful to have some tests encoding the expected behavior so that the new changes don't inadvertently break everything that was previously stable. When their costs begin to creep, one solution is to delete them, but it's probably better to just run them at different cadences (e.g. every N checkins instead of every checkin, and even more awesome on failure do a binary search to find the checkin it started failing).

What does a high-risk test look like? I don't really know what I was thinking. Now it sounds like this low-risk/high-risk business is perhaps referring to the sensitivity of the test -- what has to change for it to fail. If the test never asserts anything, then it's probably low risk, some big change needs to happen in the execution flow itself to get it to fail. But if it is asserting a particular value, or even a particular function call protocol sequence, that is higher risk, because the same execution flow might be updated to treat values/use different values that are unexpected.

* What's the business value? (Regression tests, "documentation", less debug time -- tests give info, velocity)

Again more cost-benefit thoughts. Writing tests takes time, test automation takes resources, there needs to be justification for such activities since time and resources could be used elsewhere. The documentation is in quotes because usually tests are a poor man's doc. Just a step above reading the code directly, but sometimes not even that. Where I find them most useful is in showing protocols not evident from the code itself or in actual documentation. e.g. recently I played with a sound-playing library, its documentation consisted of the "javadoc" function docs (it wasn't java). The key function was (play bytes) -- ok, so I need to pass it sound bytes, instead of a filename. Anyway I had to find a test case in a different project entirely that happened to use this library + another library that actually read in an MP3 file together. GitHub and Gist all-site code search is pretty nifty...

Good tests, when they fail, point to a specific and limited cause. Bad tests fail for all sorts of unrelated reasons. This is how good tests can reduce debug time. If someone added a value to a list that a test checks, and the test fails, it probably says something like it expected the list to be x,y,z but is now x,y,z,w. A quick look at the test code to see what system code it's calling will quickly find the culprit. Then we decide if the change is in error, or if we just need to update the test. By the way, if the answer is always "update the test" then this test is probably not very valuable, consider removing it or making it lower risk (less sensitive to breaking, i.e. just checks the list is non-empty). If it's explicitly lower risk, perhaps it's also explicitly lower value, but non-zero?

A good test suite can make you develop new code faster and make changes to old code faster. A bad test suite can slow you down terribly. If your tests fall in the latter, you're probably not getting as much business value out of them as you could, and the correct choice might very well be to just delete them all. Remember, one can write software correctly without testing it -- you do it for your test software after all.

* Unit tests are fantastic when there's a requirements oracle

I don't really think this now. And I think then it was perhaps more a dig against test driven development than unit tests themselves. You don't build a mountain road by starting to put up random guardrails and crashing things into them, then reinforcing them, until you somehow get a path to a destination you aren't even sure of... If you know the intended road and the destination though, you can get away with putting up guardrails first, even if it's awkward.

But I still don't like TDD. Even when the requirements are crystal clear, like writing a Sudoku solver, it fails. (Jeffries vs Norvig.)

Really, unit tests work well both when the goals are clear and unclear. The main audience for unit tests are devs themselves. Yes, it's annoying when code changes from what it was last week and you have to update tests you just wrote yesterday to account for it. But since unit tests are by definition typically small, the amount of work should be small and you may be able to use automated tools to do most of it. (e.g. method renaming)

* Prefer functional/system level tests (behavior under happy case, under adversary case, security, exploratory)

I don't quite agree with this. Now, my own definition of unit test is "a test that is very fast", like under 300ms, and this tends to put constraints on other things like how much it's testing and the size of the thing it's testing. Though note sometimes java tests take like 4+ seconds just to warm up loading all the classes that get transitively loaded in static dependencies, when the test itself takes milliseconds. It's still possible for such a test to be a 'unit test'.

Anyway, I'm not very interested in trying to get into finer grained test categorization. But here I would call out that "system tests", i.e. tests that require the full system to be running before you can execute them (i.e. database, webservice, cache service, log service, etc. etc.) are the worst tests and should be avoided. This matches the usual "test pyramid" idea. Yes it's nice to have such full system and end-to-end tests to verify complete functionality of the whole, but if your parts are well tested and your integration points are well tested (some people call these integration tests), and both can be done without the full system running, then you typically don't get much added value from the full system tests.

Unit tests can test happy path/edge cases/security vectors just as well as integration tests (which may run in the same 'unit test' framework).

Unit tests can be easier to extend and accumulate to facilitate exploration. "What happens if this value is used?" is easier and faster to answer with a test that doesn't require the full system up and running.

* Pure functions

Yup. Functional programming is great. And pure functions are easy to add tests for.

* Human tests at the REPL

Sucks for languages that don't have REPLs (though in JavaLand, until I can get Java11 at work and kick the tires on its "repl" usefulness, I actually integrated Armed Bear Common Lisp so I can use a repl to test certain things at will.)

I wasn't as familiar with Lisp back then, so this mindset then was much more limited than I feel now. But it suggests a style of development that I liked for Python. Python is useful in that each file is an independent module that must be imported into other files, this means you can relatively easily write files that are very self-contained. And while you're writing them, you can have a Python repl open on the side and test things out in the local context. Of course you can also load more and more (and perhaps all) of the application from the REPL and experiment cross-module, with real values. This is preferred in Lisp where it's very well supported. And indeed in Lisp you're developing your whole application from the REPL, rather than using the REPL as supplementary (as tends to be the case in Python etc).

Anyway, your REPL tests might be worthy of becoming real tests, might not. I find a lot of value in using the REPL just to double-check myself -- did I write that regex correctly? Let's verify with a few sample values. Maybe one of them gets crystalized into a low-value automated test, but the fact remains that I tested it at some point before committing it. In languages without REPLs like Java 8, a poor man's attempt at this interactive feedback is I think at the root of TDD. You can only execute code by writing and compiling new code, and the easiest way to do that in these languages is with a unit test, and since you've now written the test you're incentivized to keep it around when you're done playing. But like most REPL sessions aren't worth keeping (some are though -- doctests are still a nice concept) probably most unit tests written initially as exploratory probes or whatever are also not worth keeping.

* Tests don't improve quality, devs do. Tests can help devs improve quality.

Tests aren't magical quality-dust, they're only one of many other tools devs can take advantage of (or not) to improve code quality.

* Goals, schedules, budgets -- testing time must be taken into account. Risk analysis in terms of premature release. Generally push back and try to have quality, look for signs along the way. Ultimately a broken product making money is better than no product seeking perfection. Linux vs Hurd.

More thoughts on cost-benefit, business values. In planning, my team tries to take into account the quality-engineering time/complexity of a story for its pointing. This is good. If you're not doing testing, or doing testing late, consider the risks. (Noteworthy that the Linux kernel doesn't have tests, unless the third-party ones that developed have been integrated by now, in which case it didn't have tests for many years while still shipping and being pretty high quality -- far fewer kernel panics than BSODs for instance).

* Tests (and ..unreadable) are the difference between a program and a programming product

This is getting at the amateur/professional distinction. Tests are an indication that you're somewhat serious. Nothing wrong with one-off and maybe even hacky programs that will never see a formal test, but these days to create software that "sells" (even free open source software expected to be used by others for serious purpose) tests of some sort will be needed. Even if they're manual -- Linus would never release a kernel version that failed to boot on his hardware, right?

* Code coverage is great. But not enough. Unless testing all values of (CPU PC, System state) then it will be possible for some weird state while at a particular point in the code to kill you.

Since it's combinatorics impossible to test all those pairs, we settle for a lesser coverage metric, loosely corresponding to line/branch coverage. SQLite's philosophy is quite good here, though to me they are more inspirational and in practice most of us have to work with less awesome everything. Still, it's obvious that even if you have "100% coverage" of a division method, passing in a 0 will still blow things up. Polya's "how to solve it" book's first chapter has some related lessons. Just because you've solved the problem, or made a test that covers a function, doesn't necessarily mean you're done! There are more things you can ask/verify and it may be very useful to do so.

At the same time, in many projects it really isn't economical to truly cover every goddamn line. You can be satisfied with a less than 100% line coverage rate. (Even 0%, i.e. no tests, can be fine, depending.) Be skeptical of people pushing coverage numbers, since there's typically a political reason. Be skeptical when you're writing tests for the purpose of increasing your coverage numbers, and not to help you improve the quality of the code. (You may still write and commit them, but be aware of their real purpose...) Break down your coverage numbers by test type if you can, i.e. at least tests that require the full system running vs tests that don't.

One interesting thing from the extreme programming guys is that they claimed to need 100% (or at least very high) unit test coverage in order to successfully practice egoless/silo-less/shared ownership programming where code isn't owned and guarded by individuals or teams and anyone can feel free to make changes to any part as needed. I think this claim is overlooked by people who want to 'break down silos' or who think code ownership can be transferred so easily with a couple meetings. Having tests facilitates total newcomers coming in and doing things.

Despite the combinatorial explosion of testing when you consider state, there's a technique I'm still only aware of (haven't actually formally used it) but didn't know at all back then called All-pairs testing that can help you deal with it.

* Mocks are helpful but can give false confidence

Yes. Though to be more precise, I mean test doubles, of which mocks are just one type. Too many test doubles are a code smell for the test -- a sign the test is written with lower quality than it could be (and by extension the code under test).

It's embarrassing to stumble upon a test suite only to discover it's just testing its own mocks, not the actual code or integration points (which sometimes need to be tested with real things, e.g. a real Postgres and not an in-memory DB stub). Test coverage metrics can help spot this.

PowerMock deserves another negative callout. It is a truly awesome tool, but it should not be the first tool one reaches for in java tests, it should be a tool of last resort.


And that's it for my notes from then. One thing I didn't know about at the time was property testing, I like to encourage that these days. It's a middle-ground before you reach true formal methods, which I also think have tremendous potential and are underutilized in the industry. The lessons from the Working Effectively with Legacy Code and Mikado Method books are also useful.

Posted on 2020-06-19 by Jach

Tags: programming, testing


Trackback URL:

Back to the top

Back to the first comment

Comment using the form below

(Only if you want to be notified of further responses, never displayed.)

Your Comment:

LaTeX allowed in comments, use $$\$\$...\$\$$$ to wrap inline and $$[math]...[/math]$$ to wrap blocks.