Learning from Highly Related Examples
Chris Gresla
2025-02-07
A study in using a small amount of in-domain data to drastically improve action model performance.
Abstract
As operating systems and the applications on them evolve, action-taking models that were once successful at completing tasks on these systems become outdated and obsolete as the path to completing tasks change. Additionally, the types of tasks action models need to complete expands and diversifies with the needs of users — motivating the desire for models to generalize to novel tasks and settings. In this work we rigorously explore several methods for improving action model performance in new task settings involving goal-conditioned manipulation of Android devices. Our methods integrate data which is highly-related to new tasks. We find that the best approach to adapt models to new task settings is through lightweight training and we observe that models trained for the zero-shot action-taking perform better than few-shot prompted models, even in scenarios where models are specifically developed for integrating information from few-shots.
Background
At Wafer, we are using our OS-level access to build a comprehensive picture of the user and to understand the things they use their phones for everyday. With a deep understanding of our users, we can suggest personalized actions based on the state of their phones that significantly improve the experience of using mobile devices. To accomplish this, we need to build models that are capable of using a phone in the same way a human would. While a number of static datasets provide a playground for developing action models,1 2 there is a considerable domain shift between the tasks that are encapsulated in these datasets and those in demand by users in the real world. This necessitates a need for systems which are continuously updated or learn online.3 The actual world is constantly evolving; applications have weekly updates, users’ workflows change, and the data that was once the “ground truth” becomes stale.
An early observation from our initial experiments training action models was that static UI navigation datasets should be treated more like a pretraining set or supervised-finetuning dataset for creating models that are simply capable of taking actions and understanding screen representations. This initial phase of training imbues models with an internal representation of UIs and an understanding of how the various elements on screen pertain to the completion of a specific goal. However, this phase of training alone does not result in models that generalize to many novel task settings – we found that models trained in this manner tended to fail when exposed to new applications and types of tasks that were not present in the initial training set. To bridge the gap between these tuned action models and the tasks that need to be accomplished on user devices, we need a way to adapt the base action models to the specific task distributions.
Generally, goal-conditioned episodic interaction data for unique workflows that occur on individuals’ phones is not widely available. The closest publicly available datasets are the static datasets aforementioned, but these datasets are snapshots in time that represent a limited set of all useful user-device interactions. At Wafer, we are building at the OS-level, and therefore are in a unique position to autonomously capture episode-like data in an online setting — as the habits, trends, and needs of a user change. This position and the usage data we can acquire provides us with a holistic view of our users and it permits us an opportunity to continuously adapt our models to the world as it changes.
Problem of Interest. We explore several approaches to determine the most effective method for integrating highly-related examples for action generation. Highly-related task data can be leveraged by an action taking model in mulitple ways. Few-shot prompting4 a model with samples containing similar inputs and desired outputs has been shown to improve model performance in some settings without requiring additional model training. Whereas finetuning a pretrained model is the “canonical” method for improving task-specific performance5 - but for on-device action taking models, integrating highly-related information through continuned training introduces a plethora of complex engineering challenges.
To rigorously assess alternate approaches for leveraging highly-related data, we experiment with four prompting settings and with two methods for adapting the weights of action models we previously trained6 to improve action-taking performance in Android OS environments.
Method
Here we consider episodes of goal-conditioned device interactions, similar to those collected in AndroidControl1 and utilized in our prior work.6 The episodes we consider are comprised of a sequence of steps and an overarching goal . The goal describes what the high-level objective to be accomplished in the episode is, in natural language. The -th step in each episode is composed of a screenshot of the device’s screen , a textual description of the prior actions taken in the episode and a ground-truth action that should be generated given the context .
In this study we consider the effect of using “highly-related examples”, for improving the test time performance of action taking models. For the test episodes that we are working with, a highly-related example is a paired episode for the test episode in question. This paired episode contains a similar high-level goal (in terms of syntax and semantics), the same number of steps, and relatively similar screenshots and actions for each step. The related example is something that we can annotate from user interactions with phones and for highly-frequent workflows, this paired example could be used to improve our action models.
Prompting Setups. We test our models in zero-shot and several few-shot settings. In the zero-shot setting the prompt for every step in each episode consists of:
- The goal (expressed in natural language)
- The current screen state (a screenshot of the device)
- The history of previous actions (a string of the actions taken in the episode’s prior steps)
We investigate three settings where we prepend a test episode’s prompt with few-shots according to the following schemes:
-
1-Shot “Gold Paired Completions” (GPC/s)
- We include one example from the highly-related “reference” episode, with the same step index as the test episode
- The reference example contains: goal, state, history, and the correct action
- This provides an ideal and related demonstration to guide the model in its current context
-
3-Shot “Gold Paired Trajectory Steps” (GPT/s)
- We include sub-trajectories from the reference episode as context
- This provides the model with a “meta-pattern for similar episodes”
- The prompt includes three contexts and their corresponding actions
- These examples show the model how to handle similar situations
-
N-Shots of “Best Effort Pair Completions” (BEPC/s)
- This setting uses similar but not highly-related samples
- We test performance with N examples (where N = 1, 2, 3, 4, 5, or 7)
- Examples are selected on a “best-effort basis”, using information that is available at deployment time
- This approach is more realistic than using “oracle” samples
- It helps us understand performance in practical scenarios
For the interested reader, we provide more formal definitions of these few-shot methods in the appendix.
Developing Models with Highly-Related Data. As part of this work we investigate settings were we update the parameters of action models with highly-related data. Using the multimodal action models we trained previously, 6 we examine the effects of two training procedures.
Firstly, we assess the effects of continued supervised fine-tuning of a model’s weights on highly-related task data. This setting can be seen as a continuation of the training process we applied to create the base action model and it mimics what would occur if an action model was trained in a fully online setting. We finetune an existing action model for the zero-shot setting with the cross-entropy loss, applied only on tokens corresponding to the action in a given sample.
The second procedure we apply is that of adapting our base action model (which was trained in the zero-shot setting) to integrate information from few-shots. We train a model on samples where zero, one, or two few-shots are prepended to the prompt for a sample as additional context. When training this model we only compute the loss over the final action with the standard cross-entropy.
Experiment Setting
To create a suitable testing ground, we reverse engineered the data collection process used in 1 and then manually collected 30 episodes of device interactions, in the same format of the instruction tuning dataset we used to train our base action model. These 30 episodes consist of 15 unique “types of tasks”, where a given task type contains a goal that can be accomplished through using a specific app or series of apps. As an example, consider the following episode:
Figure 1: A sample episode pulled from the test set, each step is depicted as a screenshot with the relevant action information below.
The completion of this goal/task would involve navigating from the initial starting screen to the Waymo application, and then through the in-application workflow of booking a ride to the desired location “Balboa Cafe”. For each of the 15 task types, we collected two “paired” episodes, each with a different goal. Each pair of episodes contains the same number of steps and in each step rather similar screenshots and actions. We employ the steps of one of the paired episodes as our “test” samples and the other episode’s steps as our highly-related examples.
We reverse engineered the data collection process used in 1 and then collected 30 episodes which contain 302 individual steps. Half of these episodes and steps comprise our test set and the remaining 15 episodes/151 steps are reserved as references for the techniques we study. For the GPC and GPT few-shot settings, we include shots from paired episodes as specified in the Method section. For the BEPC setting, we aim to understand what performance differences we may observe when we use samples that are only somewhat related to a given test sample. To do this, we select samples from the training set of AndroidControl1 and from the set of reference episodes, excluding the oracle step and all future steps from the reference conversation. The selection process is conducted through comparing the prompt of the test sample with samples from this combined set, using only information we would have when the model is deployed. Specifically, we selected samples based on the index of the current step in a respective conversation, the currently active application on the device, the sequence of prior function calls in a conversation and the token overlap between episode goal strings.
We conducted two training jobs for the zero-shot continued finetuning model and the few-shot adapted model. For both models we use the same base checkpoint from our prior work, 7 and apply their respective training recipes. For the model we train on reference episodes in the zero-shot setting (wafer-re-zft
) we used the 151 steps from the reference episodes as our training set. The model we adapt to the few-shot setting (wafer-fs-adapt
) we continued training on samples from AndroidControl, where the number of few-shot per training samples and actual shots were randomly sampled. We trained both models for 1,000 optimization steps with an effective batch size of 16, learning rate of 1e-5
, single-cycle cosine learning rate schedule, warmup step proportion of 20% and FSDP.
Our evaluations are conducted on the best model from our prior trained multimodal action model (wafer-base
), 8 6, the two models trained here (wafer-re-zft
and wafer-fs-adapt
), and the non-action tuned base model9 (Idefics3
). In all evaluations, we use greedy decoded model completions and employ a “relaxed stepwise accuracy” metric akin to AndroidControl1 and our reference work. 6 This metric compares a model generated function call with a ground-truth function call, if the chosen function (name) and arguments to it (parameters provided for function) align, we count a model prediction as correct. The “relaxation” allows for not-exact, but effectively equivalent generated function calls to pass. Here (as in our prior work), we permit two types of relaxed function calls:
- We permint
click
predictions whose values of coordinates are within 3% of the total screen resolution of the ground-truth coordinates. - We also pass instances of
input_text
where the model predicted text value is within a Levenshtein distance of 3 of the ground-truth text. We lowercase the strings for the prediction and ground-truth before this comparison.
After evaluating all of the models, we found that while relaxations we introduced above capture a number of the false negatives; there are a large number of coordinate-based function (i.e, click
and long_press
) predictions that are marked as incorrect when in fact the predictions are semantically equivalent. Following these observations, we graded a large proportion of all of the evaluation results manually. We found that the “true” scores for models (determined through manual grading) is generally (one standard deviation) greater than the unadjusted (raw) stepwise accuracies. Below we report the results evaluations and include the unadjusted, “relaxed stepwise accuracy” along with the manually adjusted score following the human grading.
Results
Following the completion of our two training jobs and the implementations of the various evals, we obtained the following greedy decoded scores for the models and prompting settings of interest:
Figure 2: Evaluation results in each of the considered prompting settings, observations in bold indicate the best performance for a specific setting.
Some observations on these scores in relation to the tested methods:
- Notably, the base model,
Idefics3
, did quite well relative to the other models in the one-shot GPC setting – with the best adjusted stepwise score of 83.4%. From our manual grading we note that a large proportion of the actual failures in this specific evaluation are due to incorrect clicks or the base model generating syntactically incorrect function calls, this did not carry over to textual or no-variable functions and the base model was quite successful on those examples. - In two of the considered prompting scenarios, the lightly tuned
wafer-re-zft
outperformed the other models. This model was created with the zero-shot setting in mind and that is reflected in its score of 87.4% (best score observed overall), that performance carried over to the three-shot GPTs setting. - The model adapted to the few-shot setting
wafer-fs-adapt
, did in fact score highest with the BEPC setting. The adaptation method does improve performance for varied few-shot settings at test time, as evident by the lift in scores over thewafer-base
model, however we note that this model underperformed in the three-shot GPTs scenario. - Without any shots, the base model was incapable of generating any valid function calls and for few-shot settings, the model’s performance degrades significantly with the quality of few-shots.
Additionally, we looked at the effect induced by increasing the numbers of few-shots, a technique that can improve few-shot performance. In the following table we see the statistics of unadjusted stepwise accuracies for the BEPC evaluations:
Figure 3: Statistics for the results of BEPC evaluations, the worst values per column are colored in red and best in green.
Here we observe that increasing the number of shots in a context doesn’t lead to better performance on average, rather best performances are made better and the worst scores, worse. This is evident by the increase in performance variance, where the minimum score, maximum score, and largest standard deviation all occur in the 7-shot case, where we have the most shots included in context. The standard deviation reliably increases as a function of shots.
A Few Sample Episodes
Below we visualize a few episodes from our test set. The predictions here are from wafer-re-zft
and for each step in the episode we provide the model’s predicted action along with the ground-truth action. For coordinate-based actions, we overlay the model predictions in red and the ground-truth coordinates in blue.

{"name": "open_app", "parameters": {"app_name": "Spotify"}}
{"name": "open_app", "parameters": {"app_name": "Spotify"}}
- Step 1
- Step 2
- Step 3
- Step 4
- Step 5
- Step 6
- Step 7
Qualitatively, we found a number of interesting observations from our manual grading of results. We include a number of these in the appendix and summarize findings below:
- The base model resorts to chatting: we note that
Idefics3
seems to degenerate to non-action generations. This is particularly apparent in the BEPC setting where we observe several samples (~1 in 3) for which the model goes off the rail and starts “chatting” with the user or generating something random (like a python program), whereas in the GPC results we only observe this specific kind of model failure to follow the task instructions/format in a single instance (of 151). This suggests that more invalid and chatty completions are generated when the quality of the few-shots degrade. For the zero-shot scenario, the base model fails to correctly get a single function correct (despite a tailored prompt) and instead tends to summarize provided screenshots or elements of the UI. - A bias in trained models towards common click locations: across all of our finetuned models, we note that the most common failed
click
prediction is to generate aclick
with coordinates in the lower-right hand side of the screen when the keyboard is visible on screen. Samples withclick
coordinate values in this region are present in the base AndroidControl dataset and that bias carries over to most of our models here, notably the bestwafer-re-zft
checkpoint is among the only models to overcome this predisposition. - Few-shots can distract: across all models we notice that models tend to fail through copy/pasting data from few-shot examples when present. This is most apparent in the GPT and BEPC evaluations, where multiple samples are in-context at once. We notice that models tend to verbatim copy actions from few-shots or generate actions which relate more to the context of a few-shot example instead of the test sample’s context (which is the latest sample in the prompt).
- Models can fail by trying alternate solutions: an intriguing fail case we observed in many examples is that of a model attempting to “shortcut” its way to the completion of an episode. For several examples involving the booking of a dinner reservation, when the model is presented with the default screen of an application, it is able to view the desired restaurant for the booking. The ground-truth action is to click on the search bar and type out the restaurant name, however the model (incorrectly) attempts to directly
click
on the entry for the restaurant. We primarily observe this behavior in thewafer-re-zft
model and in samples where data presented by a screenshot suggests a more direct way to reach a desired goal-state (like an artist being available on a Spotify homepage or selecting a recent search result in lieu of retyping out the search).
Conclusion
We studied several methods for learning from highly-related examples and evaluated them with various action taking models. For the set of test tasks we gathered, we found that the best way to use related information is through updating the weights of model with a small amount of training and that these lightly tuned action models outperform counterparts trained specifically for the few-shot setting.
From our experiments we also learned that few-shot prompting can approach the performance of training a model with “ideal” few-shot samples, but performance from including few-shot prompts degrades drastically when the quality or relevance of the few-shot samples decreases. Training models specifically for the few-shot setting improves them, but not to as significant a degree as continued zero-shot training. Furthermore, including few-shot samples, at the scale of models considered, can have detrimental effects on model predictions — the extra information added to the context window can improve models but may also confuse them. Increasing the number of few-shots in a prompt increases the variance associated with the scores, without significantly improving performance.
Ultimately, the zero-shot performance of the wafer-re-zft
model is most compelling. The zero-shot setting requires substantially less tokens per request at test time and avoids some of the “distracting” effects of few-shot prompting. It also displays intriguing generalization capabilities, and overcomes some of the biases observed in models we trained on AndroidControl. This technique comes with a glaring, non-trivial challenge — that of on-device finetuning. We have conducted a few preliminary tests and find that we don’t need much optimization to adapt models to the level of performance reached in this post, but effective on-device model training is something we are iterating towards. In pursuit of enabling on-device action models that learn from live demonstrations and online interaction. If that prospect sounds interesting or if you have a perspective that you would like to share, we would love to hear from you.
Thank you to Abhay Kashyap, Nicole Fitzerald, Sudhanshu Ranjan, Daniel Bulhosa Solórzano, and Nate Harada for invaluable feedback on preliminary drafts of this post.
Appendix
Formalized Prompting Setups
We test out models in the zero-shot and in several few-shot settings. In the zero-shot case, for the -th episode and the -th step, a model’s prompt would be composed of:
We consider the following few-shot settings for leveraging highly-related data to improve action model performance:
-
1-Shot “Gold Paired Completions” (GPC/s): for the -th step in test episode , we include sample from the highly-related “reference” episode as a single in-context example in the prompt. The inclusion of a paired step’s context and completion provides an ideal demonstration to the model that provides guidance in its current task context.
- The reference episode is simply prepended to the model’s resulting in a prompt containing: where the model would need to generate the corresponding .
-
3-Shot “Gold Paired Trajectory Steps” (GPT/s): when we increase the number of shots included in the model’s prompt, we can include sub-trajectories of the reference episode as context. Here we provide models with a 3-shot prompt at test time, which provides the model with a “meta-pattern for similar episodes”:
- The prompt here is composed of three contexts and corresponding actions, denoting step in episode as
- We then have the -th 3-shot GPT prompt for episode with paired episode denotated as:
-
N-Shots of “Best Effort Pair Completions” (BEPC/s): in few-shot scenarios 1 and 2, we measure the effect on action model accuracy when using “oracle” samples. These settings provide an “upper bound” to the performance we would likely expect to observe in practice, as they are indicative of the best possible case/paired episodes. Therefore we also measure performance of the model with similar, but not highly-related samples. Specifically, we test model performance with the inclusion of few-shot samples, which are selected on a “best-effort basis”.
- Formally, the prompt for an -shot best effort sample would be:
Sample Completions from Various Models
Sample 1: Here the goal was to “Play Drake on Spotify”, the model attempted to click on Drake directly, instead of searching for the artist.
Sample 2: In this episode the model needed to book a
Waymo ride, the model tried to open Waymo via a click
, instead of an
open_app
call, recognizing that the app on screen was an equivalent
action.
Sample 3: This is a screen misunderstanding, where the prior action added the entry (“North Ocean Beach”) to the user’s watchlist (reflected in the check mark on the right hand side of the UI), the model failed to understand this icon in relation to the prior actions and attempted to click on the entry again, the ground truth action here is to mark the episode as complete.
Sample 4: Here the model clicked on TikTok incorrectly, where the ground truth (in blue) is to install the Facebook app, a few-shot that was included in the context of this sample mentions installing TikTok, this shows how the inclusion of a few-shot sample can distract a model from its test task.
N-Shot BEPC Evaluation Scores
Note: we did not conduct manual evaluations for all models and settings of “Best Effort Paired Completions”, as such we report just the stepwise accuracy for those which were not manually graded.
Evaluation results with differing numbers of few-shots, highlighting variations in performance.
References
Footnotes
-
AndroidControl- https://doi.org/10.48550/arXiv.2406.03679. ↩ ↩2 ↩3 ↩4 ↩5 ↩6
-
Android in the Wild- https://doi.org/10.48550/arXiv.2307.10088. ↩
-
Language Models are Few-Shot Learners- https://arxiv.org/abs/2005.14165 ↩
-
Universal Language Model Fine-tuning for Text Classification- https://arxiv.org/abs/1801.06146 ↩
-
UI Trees or Pixels? Studying Modality Choices in Android Action-Taking Models- https://wafer.systems/utop ↩ ↩2 ↩3 ↩4 ↩5
-
The models we trained in this work were based off a checkpoint from our prior job, from about halfway through that full training process (~15k optimization steps). Note that this is not the same checkpoint as
wafer-base
, which was the best performing model from that work. ↩ -
The best checkpoint from this job was trained to 28k optimization steps. ↩