Verify First
"Verification first development" finally puts a term to good data engineering practice without the rigidity of TDD.
I recently read an interesting take on my LinkedIn feed, and then it showed up on Hacker News. The argument was semantic - i.e., test driven development is a subset of test first development, which is a subset of verification first development.
There was a separate semantic point that held my attention:
…verification-first development (VFD), which is any means of determining the correctness of your software.
To me, this was new. I’ve not heard verification first development (VFD) used as a term of art. However, I think there is real merit in naming the ideas bundled together into VFD, especially in the spaces that I work day-to-day: data engineering & data science.
Why VFD
To me, the crux of the VFD definition above is simple: Start development only after you know how you will know if your solution is good enough. This has been the norm for the best data teams that I’ve worked in and with.
Those teams saw kinds of benefits from the upfront verification-definition effort:
- “Done” actually means done. When a data engineer announces a new table / stream / etc. complete, the end product is nearly always immediately usable. Users are delighted.
- Engineers have personal relationships with data users. In order to define a verification strategy for a feature, data engineers are forced to engage with data users. This establishes familiar relationships that make cross-functional problem-solving easy to activate.
- Development moves fast, and rework is uncommon. When verification is left to the end of development, there’s always going to be rework. Building against a test suite, or at least a defined approach to verification, moves the iteration into build time.
VFD in data engineering
In my (narrow) experience, there are a few practices that build up to create VFD in data engineering:
- Speak to users about quality up front. Ask every question needed for you to confidently articulate how you will verify your work. Walk away with a clear understanding of who is impacted how by errors and escapes.
- Write tests and QA plans before transformations.
To the degree your verification can be automated (e.g.,
great-expectations
,dbt
), write the tests first. If not all, write the ones you know matter most. If a verification can’t be automated, write down the procedure to be followed (e.g., check list in a PR description). - Release tiny increments of work. Not really a VFD thing, but avoid writing large diffs. Find the smallest feature and start there. Create many checkpoints in production that you’ve verified to be done and correct.
- Put each increment into a user’s hands. Get on a call. Walk to their desk. Have your user pull up the new work and show them the evidence of verification.
- Add the verifications you missed. There are always unanticipated rough edges. Always add automated tests when you find and fix issues. Use judgement on whether to continuously add manual procedure.
- Repeat. You’ve done great work, for which the reward is more work.
Your tools matter
If writing data tests feels like a chore, I’d bet your tools are getting in your way. I’ve never worked with someone who doesn’t want to check their work. I have worked with great engineers who resisted writing tests because it was a genuine waste of time.
Good tools make writing, running, and reporting on verification a breeze. Additionally, good tools put test and verification in-context with transformations - no need to switch screens or tools to work on tests.
Worth a look if you’re in the market:
- dbt Core and Cloud make tests a matter of SQL and config.
- Dagster offers a single UI for arbitrary tests (including from dbt).
- Databricks lets you write checks in transformation queries.
- Snowflake also lets you query with “data metric functions” to log check results.
I’m sure there are less trendy products that are also good, I just haven’t come across them.
Happy verifying!
I’ve either been under a rock, or VFD as a term is a valuable recent innovation. Regardless, I’m happy to have a term for process of some truly awesome data teams.