One of the most challenging parts of having a service-oriented architecture, or a microservice architecture, is how you connect data across services.
There are various ways to design such an architecture, but without diving too deeply into theory, there are two very common patterns. The first one is a single database that is shared across all services and used by all of them. The second one is where each service manages its own database and its own schema.
The latter brings a lot of complexity, especially when you want to create end-to-end tests, spin up local environments, or create isolated test environments. The reason is that you usually have one service that acts as the entry point. For example, it might create a user, an account, or some basic entity, and then all other services rely on the entity IDs coming from that entry service.
Let’s say your gateway holds user information. Then you have a payment service that stores payments for users. That payment service needs to know the ID of the user. As you can imagine, this creates a challenge when you want to create sample data to seed the system.
This comes up all the time: onboarding a new hire and wanting to offer meaningful data so they can open the application locally and explore it, running end-to-end tests in CI, or generating millions of rows to stress test the system and validate scalability.
Being able to maintain seeders across multiple services quickly becomes a serious problem.
The core challenge: maintainability
Below I describe one approach that proved very helpful for our team. We spent quite some time investigating different solutions that ultimately didn’t fit what we were looking for.
The biggest concern, no matter how good your seeders are, is maintainability.
In an evolving system, new entities are introduced, existing entities change shape, new columns are added, new tables appear, and new relationships are created. If your seeders don’t evolve at the same pace, they silently drift out of date and eventually break in subtle ways.
This becomes even harder when you have many services owned by multiple teams. Without strong enforcement, seeders are often the first thing to fall behind.
So the real question for us was: how do we enforce maintainability of seeders across services?
High-level approach
The approach we chose was to build a seeding mechanism inside each service.
Each service is responsible for ingesting a YAML seed file that is generated externally by a dedicated seeder service. That seeder service generates YAML files based on different strategies.
For example, we have:
- a local development strategy
- an end-to-end testing strategy
- a scalability / stress testing strategy
Any developer can create their own strategy as well.
Architecture overview
At a high level, a seeding strategy generates a YAML file based on a template. That YAML describes:
- entities
- sample data (often generated using Faker)
- correlations between entities across services
Seeder YAML template structure (Jinja2)
The seeder service uses Jinja2 templates to generate YAML files. Here’s an example template:
teams:
{% for team in teams %}
- key: {{ team.key }}
name: {{ team.name }}
{% endfor %}
users:
{% for user in users %}
- key: {{ user.key }}
name: {{ user.name }}
email: {{ user.email }}
password: {{ user.password }}
team: {{ user.team }}
{% endfor %}
After we run a strategy, the resulting YAML seed file would look like this:
teams:
- key: team_0
name: Rodriguez, Figueroa and Sanchez
- key: team_1
name: Doyle Ltd
users:
- key: user_0
name: Monica Herrera
email: smiller@example.net
password: ''
team: team_0
- key: user_1
name: Michele Williams
email: kendragalloway@example.org
password: ''
team: team_0
Seeder service and modules
Each service is a multi-module Gradle project.
One of the key modules is the seeder module. This module parses the generated YAML and uses fixtures from the main service modules to insert data.
This part is very important.
Instead of duplicating insertion logic, the seeder module reuses the same repository functions and fixtures that are already used in tests.
This tight coupling is intentional.
Enforcing correctness through tests
This design gives us a very strong guarantee: seeders cannot silently go out of date.
For example, imagine we add a new column to the payments table; say, a product_id that represents the product
associated with a payment. To write tests for that change, we must update the fixtures that insert payments into the
database.
As soon as we do that, the seeder will break if it hasn’t been updated to provide the new field. The build fails immediately.
That failure forces us to update:
- the seeding logic inside the service
- the YAML generation strategies that provide the data
By reusing repository functions from tests inside seeders, we created a form of soft enforcement. Any schema or entity change automatically forces us to update the seeders.
Shared fixture example
Below is an example of a repository function used in both tests and seeders. If a new column is added to the USERS
table, this function must be updated which immediately breaks any seeder that doesn’t provide the new field.
fun createUser(
teamId: Long,
name: String = fkr.name.name(),
email: String = fkr.internet.email(),
password: String = fkr.random.randomString(12),
): UserEntity {
val userId =
dsl.insertInto(USERS)
.set(USERS.NAME, name)
.set(USERS.EMAIL, email)
.set(USERS.PASSWORD, password)
.returningResult(USERS.ID)
.fetchSingle(USERS.ID) ?: error("Failed to create user")
return UserEntity(id = userId, name = name, email = email)
}
Running seeders locally
Locally, the setup is fairly simple.
When spinning up the Docker environment, a bash script calls the seeding executable of each service in the correct order and points it to the generated YAML files. These files live in the working directory of the seeding project.
The ordering here matters; services that create foundational entities (e.g. users, teams) must be seeded before services that reference them (e.g. payments). The bash script encodes this dependency order explicitly.
This gives developers a fully seeded local environment that mirrors real data flows across services.
Running seeders at scale
Things get more interesting when we want to do scalability testing.
For example, deploying a brand-new Kubernetes cluster and seeding it with large datasets across all services. In this case, we use init containers.
Each service runs its main container as usual, and alongside it we run an init container that:
- fetches the appropriate seed file (often from remote storage)
- executes the seeding process once, based on the chosen strategy (for example, scalability testing)
After that, we end up with a fully provisioned environment that is ready to run tests at scale.
Connecting entities across services
One remaining question is: how do we connect entities across services?
In many databases, IDs are auto-incremented, which makes this tricky. We don’t know in advance what ID a given entity will get.
To solve this, we use a basic approach: a shared key-value registry table that lives in a central database accessible to all seeders.
Each seeded entity has a unique logical key, represented as a string. For example:
user_1user_2
When a service seeds an entity, it stores a mapping in the registry table:
- entity type (e.g.
user) - logical key (e.g.
user_1) - actual database ID (e.g.
1435)
Now, when another service, say the payment service, needs to associate a payment with that user, the YAML simply
references user_1. The payment seeder looks up the registry table, resolves the actual user ID, and uses it when
inserting the payment.
This allows all services to reference entities consistently without ever hardcoding database IDs.
Once seeding is complete, the registry table is truncated. It’s only needed during the seeding process itself.
Registry table schema & lookup flow
CREATE TABLE IF NOT EXISTS id_registry (
entity_type VARCHAR(255) NOT NULL,
seed_key VARCHAR(255) NOT NULL,
entity_id BIGINT UNSIGNED,
PRIMARY KEY (entity_type, seed_key)
)
Closing thoughts
This approach isn’t free. It introduces more structure, more discipline, and a tighter coupling between tests and seeders. But for us, that trade-off was worth it.
The result is a seeding system that:
- works for local development
- scales to end-to-end and stress testing
- and, most importantly, cannot fall out of sync without breaking loudly
The key insight for us was that seeder maintenance isn’t a discipline problem, it’s a design problem. Before this, we had no seeders at all. Every engineer set up their environment manually for each service, and test environments required carefully hand-crafted data that someone had to maintain. Once we made it structurally impossible for seeders to fall behind, we stopped thinking about it.
Today, a new engineer runs a single command and has a fully seeded environment in minutes. That alone made the investment worth it.