Google Testing Blog: 2018

Testing on the Toilet: Exercise Service Call Contracts in Tests

Tuesday, November 27, 2018

This article was adapted from a Google Testing on the Toilet (TotT) episode. You can download a printer-friendly version of this TotT episode and post it in your office.

By Ben Yu

The following test mocks out a service call to CloudService. Does the test provide enough confidence that the service call is likely to work?

@Test public void uploadFileToCloudStorage() {
  when(mockCloudService.write(
          WriteRequest.newBuilder().setUserId(“testuser”).setFileType(“plain/text”)...))
    .thenReturn(WriteResponse.newBuilder().setUploadId(“uploadId”).build());

  CloudUploader cloudUploader = new CloudUploader(mockCloudService);


  Uri uri = cloudUploader.uploadFile(new File(“/path/to/foo.txt”));
  // The uploaded file URI contains the user ID, file type, and upload ID. (Or does it?)
  assertThat(uri).isEqualTo(new Uri(“/testuser/text/uploadId.txt”));

Lots of things can go wrong, especially when service contracts get complex. For example, plain/text may not be a valid file type, and you can’t verify that the URI of the uploaded file is correct.

If the code under test relies on the contract of a service, prefer exercising the service call instead of mocking it out. This gives you more confidence that you are using the service correctly:

@Test public void uploadFileToCloudStorage() {
  CloudUploader cloudUploader = new CloudUploader(cloudService);
  Uri uri = cloudUploader.uploadFile(”/path/to/foo.txt”);
  assertThat(cloudService.retrieveFile(uri)).isEqualTo(readContent(“/path/to/foo.txt"));
}

How can you exercise the service call?

Use a fake. A fake is a fast and lightweight implementation of the service that behaves just like the real implementation. A fake is usually maintained by the service owners; don’t create your own fake unless you can ensure its behavior will stay in sync with the real implementation. Learn more about fakes at testing.googleblog.com/2013/06/testing-on-toilet-fake-your-way-to.html.
Use a hermetic server. This is a real server that is brought up by the test and runs on the same machine that the test is running on. A downside of using a hermetic server is that starting it up and interacting with it can slow down tests. Learn more about hermetic servers at testing.googleblog.com/2012/10/hermetic-servers.html.

If the service you are using doesn’t have a fake or hermetic server, mocks may be the only tool at your disposal. But if your tests are not exercising the service call contract, you must take extra care to ensure the service call works, such as by having a comprehensive suite of end-to-end tests or resorting to manual QA (which can be inefficient and hard to scale).

3 comments

Efficacy Presubmit

Monday, September 17, 2018

By Peter Spragins
with input from John Roane, Collin Johnston, Rose Rodrigues and Dave Chen

A Brief History of Efficacy

Originally named "Test Efficacy", a small team was formed in 2014 to quantify the value of individual tests to the development process. Some tests were particularly valuable because they provided a reliable breakage signal for critical code. Some tests were not useful because they were non-deterministic or they never failed. Confoundingly, tests would change in value over time as well. The team’s initial intention was to present this information to developers and help them optimize the development process.

To achieve the goal of informing developers about their tests, the team had to collect a huge amount of developer infrastructure/workflow data from a variety of sources across Google. Collecting all of this data in one place turned out to be incredibly valuable.

In addition to collecting and processing the data, the team developed a somewhat radical philosophy towards running tests at scale: the only important results come from tests which deterministically fail. Running an additional test that you know will pass is not a valuable signal to developers, and likely a waste of resources.

Background on Google Presubmit

The process of committing code at Google has several testing stages. Perhaps the three most important testing stages are:

Individual ad-hoc testing
Presubmit
Continuous build/continuous integration (hereafter referred to as continuous build).

Stages 1 and 2 can actually be interleaved in any order and repeated any number of times.

A presubmit executes all of the tests which are known to be affected by the edited code within one user's proposed code changes. The "affected tests" are calculated with the help of a "project definition", a configuration maintained by teams. A presubmit can run at any point during the change proposal process, but most importantly it must run before a user can permanently commit their changes.

Continuous build, (3), is the continuous running of all tests within a project at the newest committed version of the code. Continuous build will execute tests even when they have already passed at presubmit.

The same test may run several times at presubmit during the development process, one last time at presubmit before a commit and then finally once again at continuous build, after being merged into the main branch of Google's huge repository. For this reason, a "missed failure" at presubmit is not a critical failure. The test will still be run at continuous build, and then rolled back if it fails.

Efficacy Presubmit Service

Efficacy Presubmit Service is the fusion of "running the right tests at the right time" with one of the largest collections of test/developer data in the world. The service has one simple job: save time and resources by not running, or even compiling, tests that we are very confident will pass at Presubmit. The ideal "Efficacy Presubmit" would predict which tests will pass ahead of time and only run tests which were going to fail. Then the user can get feedback from the failing tests, and fix their mistakes with the minimal possible cost of user and CPU time.

To make this idea possible we have made one significant abstraction of the actual presubmit testing process. In a given presubmit there may be zero tests run, or many. In a presubmit with one test, if that test fails then the presubmit fails. In a presubmit with a thousand tests, only one failing test will still fail the presubmit. Efficacy Presubmit makes the abstraction that each of these test executions is an equivalent unit. This greatly simplifies creating a training dataset.

Machine Learning / Probabilistic Safety

Quick background on ML

ML techniques and processes are quite well known throughout the industry at this point. The Tensorflow tutorials are a great introduction. The type of ML we use is classification. A classifier is essentially a mapping from the domain of the dataset, to the range of the classes. Mnist is a very famous example of classification. An mnist classifier maps from the domain of the input image to the range of digits {0, 1, …, 9}.

In some other classification problems, the inputs are more "tabular". A famous example of tabular classification is Iris Species. This is very similar to what Efficacy does.

Efficacy's Application of ML

Given the abstraction on the presubmit testing process described above, predicting the outcomes of automated testing at a large company is a perfect machine learning problem in many ways. You have:

The set of test executions and results is a very large labelled dataset

Copious numerical feature columns with trustworthy values

Recent failure history of each test
Various "distance" metrics from edited source files to tests - i.e. is this a test for the edited code?
Test size and runtime data

Several dimensions that can be aggregated

There are some aspects of the problem which make ML difficult as well:

The classes are highly imbalanced with respect to labels (the vast majority of tests are going to pass, not fail)
Flaky tests can mislead the model because their labels are "untrue"

We chose to reduce the problem to binary classification. The model chooses whether or not to run the test. In other words, failure is the positive class, and everything else is the negative class.

We pick a threshold that results in an extremely low number of false negatives - failing tests which are not run because the model thinks they would have passed. This does reduce the number of skipped tests, true negatives, in exchange for a very high margin of safety. In addition to this, tests will be run afterwards at continuous build anyway, making presubmit skipping very safe.

Difficulties of Scale

In addition to the problems that were natural to the "schema" of the dataset, we faced some problems due to the scale of Google's testing.

Many of these problems stem from the fact that Google works out of one large repository (paper, talk). Because of this some presubmits have a very large number of tests and some commits require a large number of presubmits before they are finished. This means that the service has to make predictions for a very large number of tests all at once. If a presubmit tried to run every test at Google, then the service would have to predict each test individually. That means N times the number of columns, etc. Loading the data to generate all of these feature values uses a lot of memory.

Another difficulty of doing this work at scale is that even with very rare false negatives, they will still happen somewhat frequently. This requires our team to be open to communication with any customer team. In some cases we may have to tell them they were the victim of a very low probability event. In other cases we may find a bug, or room for improvement.

Results

The two key numbers for the system's performance are sensitivity, the percentage of failing tests we actually execute, and specificity, the percentage of passing tests we actually skip. The two numbers go hand in hand. For a given model, requiring a higher sensitivity will result in a lower specificity, or vice versa. We can easily tune the percentage of tests skipped, resulting in changes to the fidelity of the testing signal the developers receive. When the system is wrong, it can have some negative impact to developers if the prediction is a false negative. Rarely, it will allow a developer to commit code that will break a test during continuous build. This results in a broken "project", which takes some time to detect, and then a roll-back of the code. This requires some developer time, and a flexible mentality towards testing. In order to achieve a positive balance from this, we must extract millions of skipped tests for every negative developer experience. The sensitivity of our system is very high, and our specificity is around 25%.

13 comments

Code Health: Make Interfaces Hard to Misuse

Wednesday, July 25, 2018

This is another post in our Code Health series. A version of this post originally appeared in Google bathrooms worldwide as a Google Testing on the Toilet episode. You can download a printer-friendly version to display in your office.

By Marek Kiszkis

We all try to avoid errors in our code. But what about errors created by callers of your code? A good interface design can make it easy for callers to do the right thing, and hard for callers to do the wrong thing. Don't push the responsibility of maintaining invariants required by your class on to its callers.

Can you see the issues that can arise with this code?

class Vector {
  explicit Vector(int num_slots);  // Creates an empty vector with `num_slots` slots.
  int RemainingSlots() const;  // Returns the number of currently remaining slots.
  void AddSlots(int num_slots);  // Adds `num_slots` more slots to the vector.
  // Adds a new element at the end of the vector. Caller must ensure that RemainingSlots()  
  // returns at least 1 before calling this, otherwise caller should call AddSlots().
  void Insert(int value);
}

If the caller forgets to call AddSlots(), undefined behavior might be triggered when Insert() is called. The interface pushes complexity onto the caller, exposing the caller to implementation details.

Since maintaining the slots is not relevant to the caller-visible behaviors of the class, don't expose them in the interface; make it impossible to trigger undefined behavior by adding slots as needed in Insert().

class Vector {
  explicit Vector(int num_slots);
  // Adds a new element at the end of the vector. If necessary,
  // allocates new slots to ensure that there is enough storage
  // for the new value.
  void Insert(int value);
}

Contracts enforced by the compiler are usually better than contracts enforced by runtime checks, or worse, documentation-only contracts that rely on callers to do the right thing.

Here are other examples that could signal that an interface is easy to misuse:

Requiring callers to call an initialization function (alternative: expose factory methods that return your object fully initialized).
Requiring callers to perform custom cleanup (alternative: use language-specific constructs that ensure automated cleanup when your object goes out of scope).
Allowing code paths that create objects without required parameters (e.g. a user without an ID).
Allowing parameters for which only some values are valid, especially if it is possible to use a more appropriate type (e.g. prefer Duration timeout instead of int timeout_in_millis).

It is not always practical to have a foolproof interface. In certain cases, relying on static analysis or documentation is necessary since some requirements are impossible to express in an interface (e.g. that a callback function needs to be thread-safe).

Don’t enforce what you don’t need to enforce - avoid code that is too defensive. For example, extensive validation of function parameters can increase complexity and reduce performance.

4 comments

Testing on the Toilet: Only Verify Relevant Method Arguments

Tuesday, June 26, 2018

This article was adapted from a Google Testing on the Toilet (TotT) episode. You can download a printer-friendly version of this TotT episode and post it in your office.

By Dillon Bly

What makes this test fragile?

@Test public void displayGreeting_showSpecialGreetingOnNewYearsDay() {
  fakeClock.setTime(NEW_YEARS_DAY);
  fakeUser.setName("Fake User”);
  userGreeter.displayGreeting();
  // The test will fail if userGreeter.displayGreeting() didn’t call  
  // mockUserPrompter.updatePrompt() with these exact arguments.
  verify(mockUserPrompter).updatePrompt(
      "Hi Fake User! Happy New Year!", TitleBar.of("2018-01-01"), PromptStyle.NORMAL);
}

The test specifies exact values for all arguments to mockUserPrompter. These arguments may need to be updated when the code under test is changed, even if the changes are unrelated to the behavior being tested. For example, if additional text is added to TitleBar, every test in the codebase that specifies this argument will need to be updated.

In addition, verifying too many arguments makes it difficult to understand what behavior is being tested since it’s not obvious which arguments are important to the test and which are irrelevant.

Instead, only verify arguments that affect the correctness of the specific behavior being tested. You can use argument matchers (e.g., any() and contains() in Mockito) to ignore arguments that don't need to be verified:

@Test public void displayGreeting_showSpecialGreetingOnNewYearsDay() {
  fakeClock.setTime(NEW_YEARS_DAY);
  userGreeter.displayGreeting();
  verify(mockUserPrompter).updatePrompt(contains("Happy New Year!"), any(), any()));
}

Arguments ignored in one test can be verified in other tests. Following this pattern allows us to verify only one behavior per test, which makes tests more readable and more resilient to change. For example, here is a separate test that we might write:

@Test public void displayGreeting_renderUserName() {
  fakeUser.setName(“Fake User”);
  userGreeter.displayGreeting();
  // Focus on the argument relevant to showing the user's name.
  verify(mockUserPrompter).updatePrompt(contains("Hi Fake User!"), any(), any());
}

No comments

Testing on the Toilet: Keep Tests Focused

Monday, June 11, 2018

TEST_F(BankAccountTest, WithdrawFromAccount) {
  Transaction transaction = account_.Deposit(Usd(5));
  clock_.AdvanceTime(MIN_TIME_TO_SETTLE);
  account_.Settle(transaction);


  EXPECT_THAT(account_.Withdraw(Usd(5)), IsOk());
  EXPECT_THAT(account_.Withdraw(Usd(1)), IsRejected());
  account_.SetOverdraftLimit(Usd(1));
  EXPECT_THAT(account_.Withdraw(Usd(1)), IsOk());
}

Translated to English: “(1) I had $5 and was able to withdraw $5; (2) then got rejected when overdrawing $1; (3) but if I enable overdraft with a $1 limit, I can withdraw $1.” If that sounds a little hard to track, it is: it is testing three scenarios, not one.

A better approach is to exercise each scenario in its own test:

TEST_F(BankAccountTest, CanWithdrawWithinBalance) {
  DepositAndSettle(Usd(5));  // Common setup code is extracted into a helper method.
  EXPECT_THAT(account_.Withdraw(Usd(5)), IsOk());
}
TEST_F(BankAccountTest, CannotOverdraw) {
  DepositAndSettle(Usd(5));
  EXPECT_THAT(account_.Withdraw(Usd(6)), IsRejected());
}
TEST_F(BankAccountTest, CanOverdrawUpToOverdraftLimit) {
  DepositAndSettle(Usd(5));
  account_.SetOverdraftLimit(Usd(1));
  EXPECT_THAT(account_.Withdraw(Usd(6)), IsOk());
}

Writing tests this way provides many benefits:

Logic is easier to understand because there is less code to read in each test method.

Setup code in each test is simpler because it only needs to serve a single scenario.

Side effects of one scenario will not accidentally invalidate or mask a later scenario’s assumptions.

If a scenario in one test fails, other scenarios will still run since they are unaffected by the failure.

Test names clearly describe each scenario, which makes it easier to learn which scenarios exist.

One sign that you might be testing more than one scenario: after asserting the output of one call to the system under test, the test makes another call to the system under test.

While a scenario for a unit test often consists of a single call to the system under test, its scope can be larger for integration and end-to-end tests. For example, a test that a web UI can send email might open the inbox, click the compose button, write some text, and press the send button.

6 comments

Code Health: Understanding Code In Review

Tuesday, May 01, 2018

By Max Kanat-Alexander

It's easy to assume that a developer who sends you some code for review is smarter than you'll ever be, and that's why you don't understand their code.

But in reality, if code is hard to understand, it's probably too complex. If you're familiar with the programming language being used, reading healthy code should be almost as easy as reading a book in your native language.

Pretend a developer sends you this block of Python to be reviewed:

def IsOkay(n):
  f = False
  for i in range(2, n):
    if n % i == 0:
      f = True
  return not f

Don't spend more than a few seconds trying to understand it. Simply add a code review comment saying, "It's hard for me to understand this piece of code," or be more specific, and say, "Please use more descriptive names here."

The developer then clarifies the code and sends it to you for review again:

def IsPrime(n):
  for divisor in range(2, n / 2):
    if n % divisor == 0:
      return False

  return True

Now we can read it pretty easily, which is a benefit in itself.

Often, just asking a developer to clarify a piece of code will result in fundamental improvements. In this case, the developer noticed possible performance improvements since the code was easier to read—the function now returns earlier when the number isn't prime, and the loop only goes to n/2 instead of n.

However, now that we can easily understand this code, we can see many problems with it. For example, it has strange behavior with 0 and 1, and there are other problems, too. But most importantly, it is now apparent that this entire function should be removed and be replaced with a preexisting function for detecting if a number is prime. Clarifying the code helped both the developer and reviewer.

In summary, don't waste time reviewing code that is hard to understand, just ask for it to be clarified. In fact, such review comments are one of the most useful and important tools a code reviewer has!

1 comment

Testing on the Toilet: Cleanly Create Test Data

Tuesday, February 20, 2018

This article was adapted from a Google Testing on the Toilet (TotT) episode. You can download a printer-friendly version of this TotT episode and post it in your office.

By Ben Yu

Helper methods make it easier to create test data. But they can become difficult to read over time as you need more variations of the test data to satisfy constantly evolving requirements from new tests:

// This helper method starts with just a single parameter:
Company company = newCompany(PUBLIC);


// But soon it acquires more and more parameters.
// Conditionals creep into the newCompany() method body to handle the nulls,
// and the method calls become hard to read due to the long parameter lists:
Company small = newCompany(2, 2, null, PUBLIC);
Company privatelyOwned = newCompany(null, null, null, PRIVATE);
Company bankrupt = newCompany(null, null, PAST_DATE, PUBLIC);

// Or a new method is added each time a test needs a different combination of fields:
Company small = newCompanyWithEmployeesAndBoardMembers(2, 2, PUBLIC);
Company privatelyOwned = newCompanyWithType(PRIVATE);
Company bankrupt = newCompanyWithBankruptcyDate(PAST_DATE, PUBLIC);

Instead, use the test data builder pattern: create a helper method that returns a partially-built object (e.g., a Builder in languages such as Java, or a mutable object) whose state can be overridden in tests. The helper method initializes logically-required fields to reasonable defaults, so each test can specify only fields relevant to the case being tested:

Company small = newCompany().setEmployees(2).setBoardMembers(2).build();
Company privatelyOwned = newCompany().setType(PRIVATE).build();
Company bankrupt = newCompany().setBankruptcyDate(PAST_DATE).build();
Company arbitraryCompany = newCompany().build();

// Zero parameters makes this method reusable for different variations of Company.
// It also doesn’t need conditionals to ignore parameters that aren’t set (e.g. null
// values) since a test can simply not set a field if it doesn’t care about it.
private static Company.Builder newCompany() {
  return Company.newBuilder().setType(PUBLIC).setEmployees(100); // Set required fields
}

Also note that tests should never rely on default values that are specified by a helper method since that forces readers to read the helper method’s implementation details in order to understand the test.

// This test needs a public company, so explicitly set it.
// It also needs a company with no board members, so explicitly clear it.
Company publicNoBoardMembers = newCompany().setType(PUBLIC).clearBoardMembers().build();

You can learn more about this topic at http://www.natpryce.com/articles/000714.html

7 comments

Testing Blog