I’m starting to work on some new problems and putting some thoughts together here to start building in public. This first problem is something that I’ve experienced while building many different types of products. What data to use in a local or staging environment? Lorem ipsum always comes to mind as a good analogy to what is typically in a staging environment. Data that doesn’t look real and is clearly fake doesn’t do justice to an effective testing strategy.
Problem:
- As a data engineer, I would like to create data that looks like production data so that I can effectively test my data pipeline, and/or application but isn’t a replication of a production database.
Why:
- Sensitive data in the form of PII or PHI is difficult to fully
sanitizeand remove all identifiers confidently in unstructured text
Solution:
- A tool that ingests a data schema and creates models to generate synthetic data that mimics production data.
Use cases:
- seeding a local database
- testing a data pipeline
- performance / load testing a pipeline
- mock data for QA / staging environments
Looking to crystalize this a bit more, but that’s the general gist. Hoping to have more soon. Stay tuned …

Leave a comment