Evolutionary Data Making

Chris Gresla 2026-03-17

TLDR

We needed a better way to acquire training data for the embedding model that powers search on our phone OS. Static data generation methods rely on heuristics to pair queries with relevant documents, capturing obvious associations but failing to scale or find nuanced data. Inspired by the principles of evolution, we built a search system where frontier LLMs explore, grade, and refine data generation policies, guided by a constitution of quality principles we call The Good Data Manifesto. A 0.6B parameter model trained on this data improved NDCG@10 by 37% and won or tied 82% of blind head-to-head comparisons on real user queries.

Data Generation Policies

Static Policy

Evolutionary Policy

0 samples

The Distribution Matching Problem

One can reduce all of inductive machine learning into two halves: a dataset and an algorithm for learning the associations of that dataset. In the sub-field of text retrieval, learning algorithms are quite mature. The canonical recipe¹^,² is to train dense transformer models that project queries and documents into a shared embedding space — where useful associations between data points are captured through geometric proximity under a model’s representation. Methods like Multiple Negatives Ranking loss³ and advances in hard negative mining⁴ and multi-task training have remained stable and effective choices for years. The algorithmic half of the problem is, in a sense, solved.

The dataset half is not. Among top embedding model providers, the landscape is strikingly opaque. Qwen3 Embedding⁵ outlines its synthetic data pipeline at a high level but omits key details and does not release the training dataset. Cohere, Voyage AI, OpenAI, and Mistral have not published research papers or released data describing their embedding training methodology. Jina⁶ releases model weights but not the curated fine-tuning data. A few groups have bucked this trend — Nomic⁷ released the full 235M training pairs for Nomic Embed, and BAAI published the data behind the BGE family⁸ — but even these releases contain the data itself rather than the process that created it. The methodology for generating high-quality retrieval training data remains largely unshared. We believe the data generation process is the more consequential half of the problem, and that sharing how it works is as important as sharing the data itself.

At Wafer, we are building a mobile operating system that understands you as well as you do. Part of our vision for what computing should be involves making all of your data transparently accessible. Users should be able to surf over their personal data as they enjoy surfing the web today. Paramount to our search system is the embedding model described in this post: the retrieval layer that connects natural-language queries to information scattered across a user’s digital life — emails from Gmail, group chats in WhatsApp, late-night transactions in Venmo, planning sessions in Slack, travel itineraries from your Calendar, research notes in Notion, and those esoteric songs you found on YouTube at 2am.⁹ As an example, one of our founding team members’ phone indexes data from 57 distinct applications, totaling ~194,000 individual sources drawn over the course of a few months of usage.

Our OS enables users to access their data in — unfortunately, and beautifully — novel ways. Consider a brief trip to Denver you made with your family. You want to know: “How much did I spend on my Denver trip?” To answer this yourself, you would have to go through a mental accounting exercise — thinking through things like “okay what all did I do, where did we go, what did we eat?” — and then opening each disparate application to piece together a picture of what you spent, all the while tracking everything manually. But there are data points scattered across the applications on your phone that we could use to get this information autonomously: a cash withdrawal notice from your bank, an Airbnb confirmation email, Venmo paybacks from friends, a Google Maps timeline of places visited, calendar events for flights and activities. From these disjoint pieces of information, our phone can assemble a meaningful and comprehensive answer for the user.

Unfortunately, these varied data points are not semantically similar in the classical sense, nor are they necessarily syntactically similar. Useful textual information relates to queries through a nearly infinite set of characteristics: relevance over time, causality, shared themes, common entities, project membership, transactional context, and more. We consider this challenge as a distribution matching problem¹⁰: the distribution of queries and the distribution of useful information/answers occupy different regions in the space of all text, connected by latent structures that are rich, varied, and difficult to fully enumerate in a vacuum.

The combinatorial space of possible queries to relevant sources is enormous. In this space, the most interesting associations are sparse and not syntactically colocated such that heuristics are sufficient to capture them. We can think of the data that our models could learn from as living in a high-dimensional manifold¹¹^,¹², which we refer to as The Space of All Data.

The Space of All Data

Many dimensions in this space are not useful to explore — sources from Gmail are related to sources from Outlook because both are “just emails,” Google Docs relates to Google Sheets because both live in the Google office suite, a song played at 3pm is followed by another at 3:01pm but this tells us little about the user’s intent unless you know how listening patterns reflect mood or preference. These associations, whilst valid, are not especially useful for training the kind of retrieval model that could search your phone better than you can.

However, many dimensions in the space of valid data are useful, a subset of which are:

Temporal: given a query issued at time t, which sources from the surrounding days and weeks provide relevant context?
Thematic: which sources relate to the same project, relationship, or life event?
Associative: which sources are linked by context or state that would not be captured through keyword overlap?
Cross-application: which sources from entirely different apps jointly answer a single question?
Pragmatic: which sources address what a user would actually ask about, in the way they would actually phrase it?
Negative: which sources correctly establish that something has not happened or does not exist yet? Absence of evidence does not necessarily imply evidence of absence.
Relational: which sources connect a named entity to their role in the user’s life (colleague, family, organization) across disparate contexts?

A straightforward approach to generating training data from such a space is to leverage the capability of frontier LLMs to verify the associations between data: take candidate text pairs, filter them via BM25 or existing embeddings, then prompt a frontier LLM to judge relevance¹³^,¹⁴^,¹⁵. This recipe produces labeled retrieval datasets at scale. The baseline method in this work used a similar pattern: sampling data via heuristics such as related pairs of applications, similar timestamps, and word overlap, then filtering with majority-vote LLM grading.

This method works — but it saturates. When sampling from this space with a fixed policy, one quickly exhausts all of the useful associations within the veins of data that the fixed policy covers, rendering them dry. Furthermore, such a fixed method has no chance at finding data which makes connections beyond the heuristics entirely — connecting a Slack thread about a project to a Notion doc and a Calendar invite from the same week, or linking a gym check-in notification to a health app log and a friend’s text message about meeting up after.

Establishing fixed policies for sampling data does not scale. When working with LLMs, it requires manually reviewing results and then rolling new sampling loops. When working with human annotators, the feedback loops are even looser. These points led us to an alternate framing of this problem: that the construction of a dataset is really that of a search problem — searching for the good data in an enormous combinatorial space, rather than iterating on various fixed policies in efforts to cover all of the meaningful parts.

The evolutionary search literature has shown that LLMs make remarkably effective operators for open-ended search. Evolution through Large Models¹⁶ demonstrated that LLMs serve as semantically meaningful mutation operators. FunSearch¹⁷ paired LLMs with systematic evaluation to discover new mathematical constructions. AlphaEvolve¹⁸ generalized this to entire codebases. ShinkaEvolve¹⁹ matched AlphaEvolve’s results using orders of magnitude fewer evaluations through meta-scratchpads and novelty rejection. And Auto Evol-Instruct²⁰ evolved the instruction evolution strategy itself — meta-evolution applied to data synthesis.

We also do not want to have to specify the nitty gritty for every type of data that could be found — doing so would ultimately limit the kinds of associations that could be discovered. Ideally, we establish general principles that good data should adhere to: factually correct, naturally phrased, grounded in evidence — and then let the models explore the large space of valid data with these principles in mind. The constitutional AI framework²¹ introduced exactly this idea: specifying general principles — a constitution — to guide model behavior, rather than defining quality through individual labels. We adapt this from the alignment literature to data generation.

We realized that what we needed was a dynamic data generation policy — one that evolves in response to what it discovers. The evolutionary search framework, combined with constitutional evaluation, gives us exactly this. Instead of defining a fixed policy to sample from the data space, we use frontier language models as evolutionary operators: an architect model reads an exploration history, decides where to search next, generates candidate training data, and receives constitutional feedback that drives the next iteration. The contribution here is stitching together these adjacent but not explicitly connected ideas — evolutionary search, constitutional evaluation, LLMs as operators — and applying the combination to the problem of retrieval dataset generation, where the search space is a user’s personal data and the fitness landscape is defined by a constitution.²²

Method

To test our hypothesis about synthetic data generation and to develop a better version of our retrieval system here at Wafer, we implement an evolutionary search over the space of possible retrieval training data. The system has a small number of components that we compose into a loop:

Architect: an LLM that reads the current exploration history and the corpus index, and decides what to explore next. In this work, our architect chooses one of three actions with each step of the evolutionary process:
- new_root: start fresh with unexplored sources, opening a new branch of the tree
- deepen: refine a previously validated genome — adjust sources, narrow the strategy, incorporate specific grader feedback from the last attempt
- cross_pollinate: combine elements from two validated genomes from different branches to discover associations neither would find alone
Genome: a frozen sampling context — the DNA that specifies what training data should be generated (which sources, what time window, what query strategy)
Generator: an LLM that reads a genome and produces candidate training samples
Phenotype: an actual training sample — a (query, evidence, answer) triple with cited sources
Grader: evaluates each phenotype against our constitution of eight quality principles
Exploration Tree: the accumulated history of all genomes, their fitness statistics, and per-principle failure counts

The algorithm itself is domain-agnostic: evolve sampling contexts, grade the outputs constitutionally, use the feedback to inform the next iteration. We apply it to a specific problem of relevance to improving our operating system: generating high-quality retrieval training data from the contents of a phone’s data.

A genome in our application is the DNA that defines what training data should be instantiated. It contains a set of 15-30 source groups selected for thematic coherence (same trip, same project, same person, same time period), a specific datetime that determines temporal tense and relevance, a natural-language strategy describing what kinds of queries to generate (e.g., “explore how travel planning discussions connect to booking confirmations and shared expenses among friends”), and user context (name, email, timezone). From this genome, we birth phenotypes — actual training samples that implement the genome’s recipe. The genome does not contain training data; it contains the instructions for producing it.

Phenotypes are the training samples themselves. A generator LLM receives the genome’s sources, datetime, and strategy, then produces 3-5 structured outputs: a query written as though typed on a phone (terse, colloquial), literal substrings extracted as evidence from the source text (not paraphrased), cited unique identifiers for the sources that contributed evidence, and a comprehensive grounded answer. The fitness of a genome is derived from its phenotypes’ adherence to the constitution.

A Bit of Scaffolding

Before evolution begins, we build a domain-specific scaffold: a Corpus Index that gives the architect an efficient representation of the search space. For our retrieval problem, this contains entity threads (entities appearing in 2+ source groups), temporal bins (monthly source buckets), per-application group statistics, and one-line summaries for each source group. This is the “map” the architect reads to decide where to explore. The index also tracks which groups have been visited and how many times, providing a freshness signal that discourages over-exploitation of already-sampled regions. The corpus index is specific to our problem — a different data generation task might not need one, or might need a different kind of summary. We build it primarily for efficiency: so the architect can reason about what to explore without reading every source document.

The Good Data Manifesto: Constitutional Grading

We tried several formulations for assessing the fitness of generated data before arriving at our current approach. Early iterations used a single LLM call returning a binary good/bad verdict with a brief justification. This told us whether a sample was bad but not in what specific way — and a vague “bad” signal gives the evolutionary process nothing actionable to work with. We found that fine-grained evaluation criteria was instrumental in making our approach work: knowing precisely how a sample fails produces feedback that the architect can act on in the next iteration.

Inspired by the constitutional AI approach²¹, we define quality through broad principles rather than individual labels, and let the model reason about adherence to each principle independently. The current generation of frontier models are capable enough to do this well. This gives us The Good Data Manifesto, a constitution of eight principles:

#	Principle	Tests
1	Evidence Quality	Is the evidence substantive, direct, and sufficient?
2	Query Naturalness	Does this read like something a human would actually type?
3	Practical Utility	Is this about real commitments, actions, things the user needs?
4	Query-Evidence Coherence	Does the evidence directly address the query?
5	Temporal Consistency	Does tense match source dates relative to the specified datetime?
6	Perspective	Is the query from, and the answer to, the user’s perspective?
7	Answer Accuracy	Is the answer grounded in evidence, with no hallucination?
8	Answer Completeness	Does the answer address the full scope without cherry-picking?

Grading is two-tiered. Deterministic checks (no LLM cost) catch structural failures first: empty fields, evidence not grounded as literal substrings in the source text, invalid identifiers, temporal tense mismatches, self-answering queries. This is cheap and filters out a substantial fraction of bad samples before touching the API. Samples that pass proceed to LLM constitutional grading, where a grader model evaluates each principle independently, returning per-principle verdicts with specific fix instructions.

Bad samples are not just bad in the abstract — they are wrong for specific, identifiable reasons. A sample with a future event described in past tense may violate the Temporal Consistency principle. A sample that reads like a search engine keyword query rather than something a person would type on their phone may violate Query Naturalness. Those specific failure reasons feed back into the architect’s reasoning, telling it what to change on the next iteration. This is what makes the loop genuinely evolutionary rather than merely filtering: the feedback drives the search toward greater novelty and quality, using past failures as examples of what not to repeat.

The Discovery Loop

The core process runs as tree-based exploration. At each step, the architect reads the current exploration tree — previous genomes, their success rates, per-principle failure distributions — alongside the corpus index and coverage statistics, then selects one of the three actions described above. The generator produces 3-5 candidate samples from the resulting genome and the grader evaluates each sample against our constitution. Valid samples join the dataset; rejected samples contribute diagnostic feedback that shapes the next iteration. The genome is added to the tree with its fitness statistics.

Evolutionary Discovery Loop

1/10Exploration Tree

20 nodes explored across 4 branches. The architect surveys the tree state and corpus index.

Our tree structure distinguishes this approach from a flat generation-based evolution method. It enables depth-first exploitation of promising veins (deepening a genome that found good cross-app queries about a trip) while maintaining breadth via parallel branches exploring completely different regions of the data space. The architect reads the tree holistically, balancing exploration and exploitation through natural language reasoning.

To clarify what we evolve: the target of the evolutionary process is the genome — a policy for sampling data, not the data itself. Generated data forms the basis of fitness for its genotype, but genomes are not meant to live long. Much like in ShinkaEvolve [Lange et al., 2025], genomes serve as stepping stones [Stanley & Lehman, 2015] toward greater novelty and coverage. The accumulated knowledge of the evolutionary process lives in the tree structure: the history of which regions were fertile, which principles failed where, and which cross-pollinations yielded surprising results. This tree can be resumed at any point — after updating the constitution, after adding new data to the corpus, or to upsample specific veins that proved valuable. The transitions between branches and across them form the history of the search, and the connections between nodes are where the knowledge is encoded.

Experiments

Our experimental setup is straightforward. We sample data with both types of policies — a static, heuristic-driven method and the evolutionary method — operating on the same world state (same phone database, same user, same corpus). We train models with the exact same training algorithm: Qwen3-Embedding-0.6B [Yang et al., 2025] fine-tuned with cached MNR loss, down to the same seed, all hyperparameters identical. The only variable is the dataset. Notably, the evolutionary datasets contain far fewer training pairs than the static baseline — yet learn richer representations. Our evaluations are conducted with data drawn from phone contexts that differ from our training distribution, ensuring these are genuine test sets.

Configuration	Training Data	Unique Queries	Training Pairs	Source
mnr-v2.1 (baseline)	Static policy	~5,400	~82,000	Heuristic sampling + majority-vote LLM grading
evo-3.0	Evolutionary	6,443	~25,000	Tree-based discovery, 1,020 steps
evo-3.1	Evolutionary	8,008	~38,000	Improved constitution + broader query diversity

Results

We evaluate on 57 queries where at least one model achieves a score above zero. For reference, we include the unfinetuned Qwen3-Embedding-0.6B base model alongside the static baseline and our best evolutionary model. The evolutionary model (evo-3.1) was trained with independently tuned hyperparameters (lr=8e-5, linear schedule, 100 steps). The evo-3.1 dataset contains ~38,000 training pairs versus the static baseline’s ~82,000.

mnr-v2.1 (static baseline)

evo-3.1 (evolutionary)

0.250.340.430.520.610.70

0.358

0.492

0.363

0.530

0.529

0.661

NDCG@10

MRR

Recall

We stratify the 57 queries by the type of information a model needs to correctly represent:

Model	NDCG@10 (57-case)	MRR	Recall	vs baseline
evo-3.1	0.4916	0.530	0.661	+0.1333
j9b-synth3s-ckpt50	0.4879	0.526	0.633	+0.1296
evo-3.0	0.4527	0.456	0.634	+0.0944
mnr-v2.1 (baseline)	0.3583	0.364	0.529	—
Qwen3-base (no finetune)	0.3722	0.385	0.610	+0.0139

A +37% relative improvement in NDCG@10. Per-category breakdown:

Category (# queries)	evo-3.1 (best)	mnr-v2.1	Delta	Notes
Entity (22)	0.403	0.338	+0.066	Named people, organizations, brands
Discovery (13)	0.302	0.167	+0.135	Open-ended “what do I know about X”
Temporal (19)	0.273	0.254	+0.020	Date-dependent queries
Specificity (24)	0.267	0.176	+0.091	Exact facts, order numbers, confirmation codes
Signal/Noise (14)	0.166	0.083	+0.083	Extracting from noisy marketing/promo content
Cross-App (10)	0.129	0.063	+0.067	Info spanning multiple applications

The evolutionary model improves in every category. The largest absolute gain is in Discovery (+0.135) — precisely the open-ended queries that static policies are structurally unable to produce training data for.

We conducted two separate blind head-to-head comparisons with independent judges.

A/B Test: 50 Real User Queries

For each of 50 real user queries against a 191k source-group production database, both models retrieved top-5 results. The assignment of “Model A” and “Model B” was randomized per query to prevent positional bias. Three independent judges evaluated each pair blind, with the final winner determined by majority vote:

Category	evo-3.1 wins	v2.1 wins	Ties
Travel	4	1	0
People	2	2	1
Food	2	1	2
Finance	1	1	3
Health	4	1	0
Work/Tech	5	0	0
Shopping	2	1	2
Home/Utilities	1	1	3
Events	3	0	2
Misc	3	1	1
Total	28	9	13

evo-3.1’s main advantage is breadth of relevant context and better semantic generalization. It surfaces meaningful conversational context, avoids thematic drift into irrelevant promotional material, and handles indirect queries better. The most striking category gap is Work/Tech (5-0): evo-3.1 consistently retrieved relevant Slack context while v2.1 returned generic or irrelevant results for the same queries. On a query about medication side effects, v2.1 returned 5 weight-loss advertisements; evo-3.1 retrieved the actual medical context.

v2.1’s advantage is cleaner top-1 precision on highly specific numerical queries: exact bill amounts, specific class cancellations by date. When the exact document is in the index and the query string closely matches, v2.1’s higher similarity scores surface it at rank 1 reliably.

Novel Head-to-Head: 50 Open-Ended Queries

We ran a second comparison using 50 novel queries (no overlap with the structured eval) as a three-way blind test between v2.1, evo-3.0, and evo-3.1. Results were anonymized (Model A/B/C) and sent to 3 independent judges:

Model	Judge 1	Judge 2	Judge 3	Average
v2.1	13	15	13	13.7
evo-3.0	13	9	14	12.0
evo-3.1	13	13	15	13.7

On novel, open-ended queries without ground truth, all three models perform roughly equivalently on aggregate — each with distinct strengths. v2.1 wins on travel diversity and branded entity recall. evo-3.0 wins on restaurant/food discovery and infrastructure queries. evo-3.1 wins on food delivery receipts with dollar amounts, health/medical content, tax documents, and Slack messages (the only model to find actual Slack notification content).

evo-3.1 excels on the query types most representative of actual usage: transactional retrieval (“how much did I spend on X”), work context (Slack, API docs, cloud pricing), and health and admin (medical appointments, insurance, tax docs). Combined with the +37% NDCG improvement on the structured eval, evo-3.1 is the clear choice for production.

Data Characteristics

The evolutionary process produces measurably different data:

Characteristic	synth 3.0	synth 3.1	mnr-v2.1 (static)
Genomes explored	3,060	6,153	N/A (fixed policy)
Valid samples	7,292	8,461	~5,400
Multi-source queries (2+)	20.1%	52.2%	~5%
Cross-app queries	n/a	22.2%	~0%
Discovery/recall queries	~0	826	~0%
Cost per unique query	$0.056	$0.180	Not tracked

Multi-source queries jumped from ~5% in the static method to 52.2% in evo-3.1. These require synthesizing evidence from multiple source groups simultaneously — exactly the kind of training signal that static pairing policies cannot generate.

Conclusion

The core insight is that dataset creation for retrieval is better framed as a search problem than a labeling problem. The combinatorial space of possible query-evidence-answer triples is vast, and the interesting associations — the ones that teach an embedding model something useful — are precisely those that static policies cannot anticipate. Evolutionary search, guided by constitutional fitness and an adaptive architect, systematically discovers these associations.

The method produces training data that is qualitatively different from statically generated data — more natural queries, tighter query-to-source coupling, richer cross-application associations — and quantitatively superior: a +37% relative improvement in NDCG@10 and a 28-9 win margin in blind evaluation.

Perhaps the most important observation: the barrier to generating human-level data annotations appears to have been crossed by the current generation of frontier models. What remains is using them in efficient ways. The evolutionary framework provides one such way — turning the broad intelligence of frontier models into structured, constitutionally grounded, fitness-directed data generation that produces training signal no static policy could match.

Future Work

Several directions remain open:

Diversity and Niche Exploitation

The evolutionary process, like all search processes, is susceptible to niche exploitation. In our runs, topics like UI-TARS model evaluations grew from 4 queries in generation 0 to 26 by generation 9. Incorporating dataset-level diversity objectives — akin to MAP-Elites or novelty search [Stanley & Lehman, 2015] — could maintain breadth while the architect deepens promising veins.

Training Larger Models

All experiments used Qwen3-Embedding-0.6B, which showed signs of a compute ceiling — all checkpoints beyond a certain point plateaued regardless of data quality. This is a limitation of both the model capacity and our deployment environment (on-device inference on a phone). Training larger models on the same evolutionary data would test whether the data quality gains scale with model size, and whether the representational bottleneck shifts from data to architecture.

Constitution Completeness

Our eight principles cover structural quality (grounding, coherence, accuracy) and pragmatic quality (naturalness, utility, perspective), but the space of desirable retrieval data properties is larger. Can we define a comprehensive constitution that covers effectively all attributes of good data? Or does the constitution itself need to evolve?

Cost Reduction

At ~$0.18 per unique sample for evo-3.1, the method is tractable for focused domains but expensive at web scale. The architect model dominates costs (63% of per-step expenditure). Smaller models for generation/grading with frontier models reserved for the architect, or local model serving, are active directions.

Entity Relationship Resolution

The current system operates on literal entity names. It cannot infer that “Madhu” is the user’s mother. Incorporating entity relationship resolution would unlock a richer class of training data (“when is my mom’s art fair thing?”).

Appendix

Implementation Details

Deduplication

Two levels of deduplication prevent the dataset from collapsing. Run-level dedup uses a signature of (query text, frozenset of cited identifiers) to reject identical query-source combinations. Cross-run dedup uses Jaccard similarity on query tokens (threshold 0.65) to catch near-duplicate phrasings.

Label Collision Bug

An important note on experimental integrity: during our training study, we discovered that the combined_mnr() dataset creation function had a label collision bug (Path(path_str).parent.name) that silently dropped evo-3.1 data from combined datasets. All “combined” models in jobs 7 and 8a-8d were actually trained on evo-3.0 data only. This was fixed, and the corrected j9 training jobs produced the final champion model.

Problem Formulation

Given a personal data corpus C consisting of N source groups across K applications, we seek to produce a training dataset D = {(q_i, e_i, a_i)} of query-evidence-answer triples that, when used to train an embedding model via MNR loss, produces useful representations for personal search. The search space is combinatorial: ~28,000 source groups x arbitrary temporal contexts x unbounded query phrasings.

Sample Showcase

Examples drawn from the production evolutionary run:²³

Terse, phone-typed query (4 words)

Query: “Any B200s on demand?” Evidence: “8x NVIDIA HGX B200 instances | available on-demand | Lambda Inc.” Answer: “Yes, you have on-demand access to 8x NVIDIA HGX B200 instances via Lambda Inc…” Apps: Gmail | Tree depth: 6

Four words, colloquial, requires understanding “B200s” = NVIDIA GPUs. Tree depth 6 — the evolutionary process refined the strategy six times to produce it.

Cross-application synthesis (3 apps)

Query: “What’s the latest on K2 model training and the new architecture Vik shared?” Evidence: “Kimi K2 | large language model series developed by Moonshot AI team | Define a set of tool calls and associated schemas” Answer: “According to your Slack messages from late July, the plan for the Kimi K2 model…” Apps: Slack + Twitter + Notion | Tree depth: 2

Three completely different applications contribute to a single answer about an ML research topic.

Distribution mismatch: query about a person, evidence from a transaction

Query: “What did Athar send about the cleaning and how much were they paid?” Evidence: “Athar B. | Cleaning | ‘Hello please can you confirm?’” Answer: “On Monday morning, July 28, 2025, Athar B. messaged you via Taskrabbit regarding your cleaning task…” Apps: Gmail | Tree depth: 0

The query mentions a person and a service; the evidence is a Taskrabbit receipt email. Low semantic similarity between query and source — precisely the kind of association static methods miss.

Deep evolution: negative answer from Slack (depth 27)

Query: “Is there a link to the wafer api subdomain yet?” Evidence: “Andre | set up an api.wafer.systems subdomain eventually | but that can wait” Answer: “No, you don’t have a link to the API subdomain yet. On May 17, 2025, Andre mentioned…” Apps: Slack | Tree depth: 27

The deepest evolved sample. The correct answer is “no, not yet” — evidence of the absence of a thing.

Sample Comparison: Evolutionary vs Static

To illustrate the structural differences between the two data generation methods, here are representative samples drawn from the actual production datasets.

Evolutionary samples

Query: “Where have I been driving the last couple of days and did I pay Vishal?” Evidence: “Exit 1B US-101 N / Golden Gate Bridge” | “Masonic Ave” | “You paid Vishal Vaddadhi $55.00” | “Parking in front of cocobang” Apps: Gmail + Messages + Maps | Tree depth: 1

Query: “What are my plans with Vishal and Youngchul for this evening?” Evidence: “Reminder: upcoming reservation for SAN HO WON” | “Saturday, January 24, 2026, 6:00 PM” | “communal Counter Seating Reservation” | “Leggo” | “We doin the sake night??” Apps: Gmail + WhatsApp + Messages | Tree depth: 1

Query: “Did I buy the Doraemon collection?” Evidence: “Your favorite blue robot cat just got cozier — dropping tomorrow!” | “Doraemon slips into Gelato Pique’s cloud-soft embrace” | “limited capsule drops” Apps: Gmail | Tree depth: 0

Query: “status of project switchboard features” Evidence: “perf(WAF-362): refactor produce_outlook” | “PR #215” | “CI / test (ubuntu-latest, default, 1.85.0) Failed in 1 minute and 36 seconds” Apps: Gmail + Slack | Tree depth: 8

Static baseline samples (mnr-v2.1)

Query: “What’s the company address listed in those ParkMobile emails?” Positive: “As a thank you for being a ParkMobile user, enjoy a $30 gift card on us! …” Apps: Gmail (single source)

Query: “Why do they ask me to arrive 5 minutes early for the visit?” Positive: “Please arrive 5 minutes before your visit to allow for check in. / location: One Medical Group, 3001 Palm Way, Suite 134, Austin, TX 78758 / name: Doctors Appointment — John Kasel, PA-C” Apps: Calendar (single source)

Query: “what kind of event is new year’s eve on my calendar — is it marked as a holiday/observance?” Positive: “Observance / To hide observances, go to Google Calendar Settings > Holidays in United States / name: New Year’s Eve” Apps: Calendar (single source)

The structural contrast is clear: the static method produces flat (query, single-document) pairs from individual applications. The evolutionary method produces structured, multi-source, cross-application triples with terser, more natural queries. At higher tree depths, the evolutionary queries become increasingly colloquial — closer to what users actually type on their phones.

Reimers & Gurevych. “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.” EMNLP 2019. https://arxiv.org/abs/1908.10084 ↩
Karpukhin et al. “Dense Passage Retrieval for Open-Domain Question Answering.” EMNLP 2020. https://arxiv.org/abs/2004.04906 ↩
Henderson et al. “Efficient Natural Language Response Suggestion for Smart Reply.” 2017. https://arxiv.org/abs/1705.00652 ↩
Xiong et al. “Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval.” ICLR 2021. ↩
Yang et al. “Qwen3 Technical Report.” 2025. https://arxiv.org/abs/2505.09388 ↩
Sturua et al. “jina-embeddings-v3: Multilingual Embeddings With Task LoRA.” 2024. https://arxiv.org/abs/2409.10173 ↩
Nussbaum et al. “Nomic Embed: Training a Reproducible Long Context Text Embedder.” 2024. https://arxiv.org/abs/2402.01613 ↩
Xiao et al. “C-Pack: Packaged Resources To Advance General Chinese Embedding.” 2023. https://arxiv.org/abs/2309.07597 ↩
The full list of indexed applications on a single phone in our system is 57, including Gmail, Maps, Messaging, Calendar, Slack, WhatsApp, Notion, Twitter/X, YouTube, Photos, Uber, Uber Eats, Venmo, Linear, Shazam, United Airlines, and many others. Spotify alone contributes ~163,000 notification-based sources. ↩
We use “distribution matching” deliberately. In generative modeling, the term refers to learning p_model ~ p_data. Here, the problem is analogous: the training data creation process is itself a generative sampling problem — we are trying to sample (query, evidence, answer) triples from the joint distribution of useful associations, which is multimodal, has complex latent structure, and cannot be effectively sampled with simple heuristics. The downstream retrieval task is alignment (learning a similarity function), but the data creation problem that this post addresses is genuinely a distribution matching challenge. ↩
Stanley & Lehman. “Why Greatness Cannot Be Planned: The Myth of the Objective.” Springer, 2015. ↩
Huh et al. “The Platonic Representation Hypothesis.” ICML 2024. https://arxiv.org/abs/2405.07987 ↩
Wang et al. “Improving Text Embeddings with Large Language Models.” ACL 2024. https://arxiv.org/abs/2401.00368 ↩
Lee et al. “Gecko: Versatile Text Embeddings Distilled from Large Language Models.” 2024. https://arxiv.org/abs/2403.20327 ↩
Morris. “How to train the best embedding model in the world.” Token for Token (Substack), March 9, 2026. ↩
Lehman et al. “Evolution through Large Models.” 2022. https://arxiv.org/abs/2206.08896 ↩
Romera-Paredes et al. “Mathematical Discoveries from Program Search with Large Language Models.” Nature 625, 468-475, 2023. ↩
Novikov et al. “AlphaEvolve: A Coding Agent for Scientific and Algorithmic Discovery.” 2025. https://arxiv.org/abs/2506.13131 ↩
Lange et al. “ShinkaEvolve: Towards Open-Ended and Sample-Efficient Program Evolution.” 2025. https://arxiv.org/abs/2509.19349 ↩
Zeng et al. “Automatic Instruction Evolving for Large Language Models.” EMNLP 2024. https://arxiv.org/abs/2406.00770 ↩
Bai et al. “Constitutional AI: Harmlessness from AI Feedback.” 2022. https://arxiv.org/abs/2212.08073 ↩ ↩²
Technically, this is evolution with an indirect genotype-phenotype mapping rather than co-evolution in the strict sense [Hillis, 1990] — there is one evolving population (genomes/policies) whose fitness is evaluated through their phenotypic expression (generated data), not two co-evolving populations. The data generation policy evolves; the data are emergent artifacts. ↩
Samples are drawn from the production evolutionary run state file. Queries, evidence, and answers are reproduced verbatim. ↩