Starting up

I’m starting to work on some new problems and putting some thoughts together here to start building in public. This first problem is something that I’ve experienced while building many different types of products. What data to use in a local or staging environment? Lorem ipsum always comes to mind as a good analogy to what is typically in a staging environment. Data that doesn’t look real and is clearly fake doesn’t do justice to an effective testing strategy.

Problem:

As a data engineer, I would like to create data that looks like production data so that I can effectively test my data pipeline, and/or application but isn’t a replication of a production database.

Why:

Sensitive data in the form of PII or PHI is difficult to fully sanitize and remove all identifiers confidently in unstructured text

Solution:

A tool that ingests a data schema and creates models to generate synthetic data that mimics production data.

Use cases:

seeding a local database
testing a data pipeline
performance / load testing a pipeline
mock data for QA / staging environments

Looking to crystalize this a bit more, but that’s the general gist. Hoping to have more soon. Stay tuned …

Craig Calderone Thoughts

Starting up

Leave a comment Cancel reply

Starting up

Share this:

Leave a comment Cancel reply