In my profession, I meet lots of customers who are starting or in the middle of their digital transformation, and many of whom are trying to solve the puzzle around Continuous Delivery/Deployment. For the past 5 years, the discussion around software development has been focused on speed and quality. Most people say that automation is key, and so do I – automation makes tasks execute faster, more reliably, and more repeatably.
One of the many automations that organizations are struggling with, though, is test automation, and the reason for this is test data. Shared test environments mean sharing and overwriting test data. Once a test execution has run, the data is ‘dirty’ and can’t be reused for subsequent tests without a “refresh.”
In less than a year, we also need to look at the new regulatory requirements on test data. There are millions of articles about the implications of GDPR, but I will stick to what is required from a testing standpoint.
Sensitive vs Insensitive
Names, addresses, credit card numbers, account numbers, phone numbers, account names, etc. are all obviously sensitive, but the real challenge is to find the limitations of insensitive data that can make any data become sensitive. For example, below I have anonymized the author, the account, and the image, but most of you will be able to re-identify this record, due to “known behavior.”
In other words, a combination of known behavior and free text with different individually insensitive pieces of data can become so rare that it’s no longer insensitive. I.e, insensitive data can become sensitive if the depth of a typical data set is too shallow.
You must ask yourself these questions:
- What data is sensitive?
- Is there behavior, specific content or are there pieces of insensitive data that becomes sensitive when combined?
Sometimes even content in free text can be sensitive.
There are two distinctive ways of making sensitive data unidentifiable; anonymization and pseudonymization. Both mask data based on an encryption algorithm (key), but with anonymization, you throw away the key, so it’s not reversible.
Once all this data plumbing is done, we need to synchronize it across all databases, files, and other data stores that are managed by the applications we are interacting with, internal or external. The challenge of doing that across all internal applications is hard enough, but what about 3rd parties? In the Twitter example, let’s assume that I’m testing a new feature where comments to a tweet would also show up as a comment to a connected Facebook account. Somehow, we need to synchronize our anonymized test data to fulfill the testing of that feature, too.
SSS - Synthesize, Simulate, and Stimulate
With anonymized (or pseudonymized) data, it is harder for us humans to identify the test cases by the data, hence manual testing will be more difficult. For Continuous Delivery initiatives, on the other hand, data anonymization works in your best interest.
In order for test automation to work, we need proper, synchronized, and untouched test data in all systems that are accessed during the execution of a test case.
We need test data for 3 main purposes:
- To feed our test scripts – manual or automated
- To feed our SUT – so that validation of test cases is synchronized with (1)
- To feed simulations/virtual assets (stubs/mocks) with synchronized test data in data stores where we are not in control
When we want to take one step further and go from automated tests and want to be able to do it continuously, we also need to do this repeatedly, preferably from an automation script.
Proactive vs. Reactive
If we look past GDPR, I believe that organizations need to look more into the interface and data models. They tell us about valid and invalid structures and data through the “contract” or specification. We need to proactively test using those contracts and specifications instead of reactively test against data. We need to think in terms of bug prevention where QA is not an event or a specific phase, but rather an ongoing process.