Data seeders in a service-oriented architecture

One of the most challenging parts of having a service-oriented architecture, or a microservice architecture, is how you connect data across services.

There are various ways to design such an architecture, but without diving too deeply into theory, there are two very common patterns. The first one is a single database that is shared across all services and used by all of them. The second one is where each service manages its own database and its own schema.

The latter brings a lot of complexity, especially when you want to create end-to-end tests, spin up local environments, or create isolated test environments. The reason is that you usually have one service that acts as the entry point. For example, it might create a user, an account, or some basic entity, and then all other services rely on the entity IDs coming from that entry service.

Let’s say your gateway holds user information. Then you have a payment service that stores payments for users. That payment service needs to know the ID of the user. As you can imagine, this creates a challenge when you want to create sample data to seed the system.

This comes up all the time: onboarding a new hire and wanting to offer meaningful data so they can open the application locally and explore it, running end-to-end tests in CI, or generating millions of rows to stress test the system and validate scalability.

Being able to maintain seeders across multiple services quickly becomes a serious problem.

The core challenge: maintainability

Below I describe one approach that proved very helpful for our team. We spent quite some time investigating different solutions that ultimately didn’t fit what we were looking for.

The biggest concern, no matter how good your seeders are, is maintainability.

In an evolving system, new entities are introduced, existing entities change shape, new columns are added, new tables appear, and new relationships are created. If your seeders don’t evolve at the same pace, they silently drift out of date and eventually break in subtle ways.

This becomes even harder when you have many services owned by multiple teams. Without strong enforcement, seeders are often the first thing to fall behind.

So the real question for us was: how do we enforce maintainability of seeders across services?

High-level approach

The approach we chose was to build a seeding mechanism inside each service.

Each service is responsible for ingesting a YAML seed file that is generated externally by a dedicated seeder service. That seeder service generates YAML files based on different strategies.

For example, we have:

a local development strategy
an end-to-end testing strategy
a scalability / stress testing strategy

Any developer can create their own strategy as well.

Architecture overview

Seeder architecture overview

At a high level, a seeding strategy generates a YAML file based on a template. That YAML describes:

entities
sample data (often generated using Faker)
correlations between entities across services

Seeder YAML template structure (Jinja2)

The seeder service uses Jinja2 templates to generate YAML files. Here’s an example template:

teams:
  {% for team in teams %}
  - key: {{ team.key }}
    name: {{ team.name }}
  {% endfor %}

users:
  {% for user in users %}
  - key: {{ user.key }}
    name: {{ user.name }}
    email: {{ user.email }}
    password: {{ user.password }}
    team: {{ user.team }}
  {% endfor %}

After we run a strategy, the resulting YAML seed file would look like this:

teams:
  - key: team_0
    name: Rodriguez, Figueroa and Sanchez
  - key: team_1
    name: Doyle Ltd

users:
  - key: user_0
    name: Monica Herrera
    email: smiller@example.net
    password: ''
    team: team_0
  - key: user_1
    name: Michele Williams
    email: kendragalloway@example.org
    password: ''
    team: team_0

Seeder service and modules

Each service is a multi-module Gradle project.

One of the key modules is the seeder module. This module parses the generated YAML and uses fixtures from the main service modules to insert data.

This part is very important.

Instead of duplicating insertion logic, the seeder module reuses the same repository functions and fixtures that are already used in tests.

This tight coupling is intentional.

Enforcing correctness through tests

This design gives us a very strong guarantee: seeders cannot silently go out of date.

For example, imagine we add a new column to the payments table; say, a product_id that represents the product associated with a payment. To write tests for that change, we must update the fixtures that insert payments into the database.

As soon as we do that, the seeder will break if it hasn’t been updated to provide the new field. The build fails immediately.

That failure forces us to update:

the seeding logic inside the service
the YAML generation strategies that provide the data

By reusing repository functions from tests inside seeders, we created a form of soft enforcement. Any schema or entity change automatically forces us to update the seeders.

Shared fixture example

Below is an example of a repository function used in both tests and seeders. If a new column is added to the USERS table, this function must be updated which immediately breaks any seeder that doesn’t provide the new field.

fun createUser(
    teamId: Long,
    name: String = fkr.name.name(),
    email: String = fkr.internet.email(),
    password: String = fkr.random.randomString(12),
): UserEntity {
    val userId =
        dsl.insertInto(USERS)
            .set(USERS.NAME, name)
            .set(USERS.EMAIL, email)
            .set(USERS.PASSWORD, password)
            .returningResult(USERS.ID)
            .fetchSingle(USERS.ID) ?: error("Failed to create user")

    return UserEntity(id = userId, name = name, email = email)
}

Running seeders locally

Locally, the setup is fairly simple.

When spinning up the Docker environment, a bash script calls the seeding executable of each service in the correct order and points it to the generated YAML files. These files live in the working directory of the seeding project.

The ordering here matters; services that create foundational entities (e.g. users, teams) must be seeded before services that reference them (e.g. payments). The bash script encodes this dependency order explicitly.

This gives developers a fully seeded local environment that mirrors real data flows across services.

Running seeders at scale

Things get more interesting when we want to do scalability testing.

For example, deploying a brand-new Kubernetes cluster and seeding it with large datasets across all services. In this case, we use init containers.

Each service runs its main container as usual, and alongside it we run an init container that:

fetches the appropriate seed file (often from remote storage)
executes the seeding process once, based on the chosen strategy (for example, scalability testing)

After that, we end up with a fully provisioned environment that is ready to run tests at scale.

Connecting entities across services

One remaining question is: how do we connect entities across services?

In many databases, IDs are auto-incremented, which makes this tricky. We don’t know in advance what ID a given entity will get.

To solve this, we use a basic approach: a shared key-value registry table that lives in a central database accessible to all seeders.

Each seeded entity has a unique logical key, represented as a string. For example:

user_1
user_2

When a service seeds an entity, it stores a mapping in the registry table:

entity type (e.g. user)
logical key (e.g. user_1)
actual database ID (e.g. 1435)

Now, when another service, say the payment service, needs to associate a payment with that user, the YAML simply references user_1. The payment seeder looks up the registry table, resolves the actual user ID, and uses it when inserting the payment.

This allows all services to reference entities consistently without ever hardcoding database IDs.

Once seeding is complete, the registry table is truncated. It’s only needed during the seeding process itself.

Registry table schema & lookup flow

CREATE TABLE IF NOT EXISTS id_registry (
    entity_type VARCHAR(255) NOT NULL,
    seed_key VARCHAR(255) NOT NULL,
    entity_id BIGINT UNSIGNED,
    PRIMARY KEY (entity_type, seed_key)
)

Closing thoughts

This approach isn’t free. It introduces more structure, more discipline, and a tighter coupling between tests and seeders. But for us, that trade-off was worth it.

The result is a seeding system that:

works for local development
scales to end-to-end and stress testing
and, most importantly, cannot fall out of sync without breaking loudly

The key insight for us was that seeder maintenance isn’t a discipline problem, it’s a design problem. Before this, we had no seeders at all. Every engineer set up their environment manually for each service, and test environments required carefully hand-crafted data that someone had to maintain. Once we made it structurally impossible for seeders to fall behind, we stopped thinking about it.

Today, a new engineer runs a single command and has a fully seeded environment in minutes. That alone made the investment worth it.