UI Trees or Pixels? Studying Modality Choices in Android Action-Taking Models

Chris Gresla

2025-02-06

How should a model that operates the Android OS be architected?


Abstract

We study how choices of screen representation modality impact the performance of action-taking models on the Android OS. Several recent works probe the design space of these action-taking models, 1 2 3 4 5 but the key architectural question of what screen representation modality should be used as a base for device manipulation model perception remains underexplored. We study systems based on textual and visual modalities and find that visual models which are built to percieve their environment through device screenshots provide a compelling architecture for action models. These multimodal models perform on par with text-only action models, but they learn richer relations between spatial actions and screens, require substantially less tokens to process the same information, continually learn whilst text-only approaches overfit, and present a general base for other applications that require processing visual inputs. Our findings inform the design of future action-taking models, suggesting that integrating information across modalities is paramount to effective intelligence.

Introduction

The proliferation of Large Language Models (LLMs) and Visual Language Models (VLMs) has paved the way for the development of new paradigms for interacting with technology. LLMs/VLMs contain broadly applicable, representational knowledge - these representations are key components in prompt-based agentic systems 6 7 8 9 10 11 and are crucial starting checkpoints in the training of models for various applications. Device manipulation is emerging as a significant downstream application of LLMs. These systems are capable of interacting with digital devices, such as computers and mobile phones, with the same inputs that humans provide. Mobile operating systems, such as Android,11 provide a unique location for language models to interact with and assist humans. At the system level, an operating system has a cornucopia of data on a user: full of device interaction traces, user preferences, and highly informative contexts. Responsible and effective use of this information could allow LLMs to perform highly-personalized tasks on behalf of users, drastically improving the usage experience of mobile devices and driving a new paradigm of human-ai interaction.

To date, action-taking model architecture have primarily been developed with multimodal representations. These models percieve their environment through screenshots of the devices they are manipulating. On Android in particular, there is a second option available for the representing screen states, accessibility trees (UI Trees) which encode screen information purely in text. 12 When developing on device action models, it is not clear which representation provides a better foundation for such an agentic system. Therefore in this study we investigate the key question of:

“What representational modality provides the best base for action-taking models?”

Addressing this question is essential as on-device action-taking models need to comprehensively understand the semantics of screen states in order to operate reliably while being deployed in heavily constrained computational settings.

We explore this question by training an ensemble of models suitable for on-device, Android action-taking applications and compare the characteristics of these models, trained on textual and visual screen representations. We conduct our experiments with the AndroidControl dataset13 and use a unified, coordinate-based action space in our primary experiments. We compare models based on their abilities to generate suitable actions in the pursuit of completing goals and in the architectural characteristics of both methods. Additionally, we examine the training dynamics of our models, finding insights related to how the choice of representation affects convergence. This study analyzes the characteristics of models for given choices of screen representational modalities.

Preliminaries

Problem setting. The development of action-taking models for the Android Operating System (OS), requires an answer to the unique question what screen representation modality is the most effective. Several action-taking models have been developed and use either textual representations, such as UI trees, or visual representations, like screenshots, to interpret device states in action-taking trajectories. However, no works study both representations and compare the effects of both against one another. As such it remains unclear which modality provides a more robust foundation for developing action models. We investigate this problem in our study and address the crucial architectural question which underpins the design of capable action-taking models.

The Dataset and Model Training. To enable our study we require 1- goal-conditioned device manipulation data with screen representations for both of the modalities of interest and 2- comparable base models suited for training action-models which percieve the environment through each of the modality choices.

To fulfill the first requirement, we leverage the AndroidControl dataset, introduced in Li, Wei, et al..13 This dataset contains goal-conditioned episodes of user-device interactions which were collected by human annotators. Each episode starts with an initial screen state and a goal that needs to be accomplished; the goals are then completed within episodes through a sequence of steps. For each step in an episode, the state of the device is represented with either: an Android accessibility tree (textual representation of the screen) or with a screenshot (rasterized image). Additionally, an action is applied to the device which brings the episode closer to completion and advances the episode to the next step. This dataset contains both modalities of screen representations for all steps in all of the dataset’s episodes, providing ideal pairings of textual and visual screen states for our experiment. The full AndroidControl dataset contains 15,283 episodes, and a total of 99,131 individual steps. We provide a sample from an episode with all of the attirbutes that we use in the appendix.

For specific models to train on the textual and visual modalities, we choose Llama-3.1-8B-Instruct14 and Idefics3-8B-Llama315 respectively. These two models are derivatives of the widely used Llama 3 architecture, with Idefics3 being a variant of Llama-3.1-8B-Instruct, adapted to the multimodal setting with the integration of the SigLIP vision encoder.16 Neither of these models are trained specifically for goal-conditioned device manipulation tasks, and preliminary evaluations we conducted that showed these two models performed similarly without task-specific training on their respective modalities.

Our datasets are derived from processing the AndroidControl dataset, 13 tailored for the two distinct models. The models’ task is to predict the appropriate function calls for a given step within an episode. Each step-level sample comprises a screen representation ss (either a screenshot or a UI Tree), the sequence of prior actions within the episode zz, a high-level goal gg, and the ground-truth action aa.

arch-diagram

Model Inputs per Modality: each sample in our dataset provides a model with the episode’s goal, the prior actions already taken in the episode and a modality specific screen representation for that step.

For the text modality, the screen representation ss is constructed by concatenating the language model tokens of the tokenized UI Tree. If the UI Tree comprises tt tokens, the screen representation is s=[e1,e2,,et]Rn×dtexts = [e_1, e_2, \cdots, e_t] \in \mathbb{R}^{n \times d_{text}}, where dtextd_{text} denotes the dimensionality of the language model’s hidden size

In the visual modality, the representation ss consists of sequence vision encoder states projected into the language model’s embedding space. Given a device screenshot ximageRH×Wx_{image} \in \mathbb{R}^{H \times W}, the multimodal model’s vision encoder fvf_v processes the image to yield hidden states xfeaturesRm×dv=fv(ximage)x_{features} \in \mathbb{R}^{m \times d_{v}} = f_v(x_{image}), where dvd_{v} is the vision encoder’s hidden dimension. These features are projected into the language model’s embedding space using the perceiver resampler fpf_{p}, 15 resulting in the representation sRm×dtext=fp(xfeatures)s \in \mathbb{R}^{m \times d_{text}} = f_\text{p}(x_{features}). It is important to note that the number of tokens mm required to represent an image in the input sequence may differ from the number of tokens tt needed to represent the corresponding UI Tree.

For each step across all episodes, a prompt p=(s,z,g)=(e1,e2,,en)p = (s, z, g) = (e_1, e_2, \cdots, e_n) of nn token embeddings is constructed to encapsulate the screen’s state and the episode’s context. The model is then supervised to generate the tokens for the ground-truth action a=(a1,a2,,aT)a = (a_1, a_2, \cdots, a_T) conditioned on pp by employing the standard cross-entropy loss. 17 Specifically, we optimize the model parameters θ\theta by minimizing the loss function:

L(a,p,θ)=t=1Ti=1Cyt,ilog(p(at,ia<t,s,z,g;θ))L(a, p, \theta) = - \sum_{t=1}^{T} \sum_{i=1}^{C} y_{t,i} \log( p(a_{t, i} | a_{<t}, s, z, g; \theta))

Where TT is the length of the sequence of tokens representing the action aa, CC is the language model’s vocabulary size, yt,iy_{t, i} is the correct token for at the tt-th sequence position, and p(at,ia<t,s,z,g;θ)p(a_{t, i} | a_{<t}, s, z, g; \theta) is the predicted probability for token at,ia_{t, i} at position tt under model parameters θ\theta.

Model Training Details. In this work, we finetune two models on textual and visual screen representations. Our text-only model is based on Llama-3.1-8B-Instruct14 (text-model), and is trained on UI trees. Our visual model is based on Idefics3-8B-Llama3 15 (visual-model), and is trained on screenshots. In both cases we use LoRA18 to finetune the models – we adapt all linear layers (including the vision tower for the visual-model) with a rank of 16. We conducted independent hyperparameter sweeps for both models, searching over values of: learning rate, weight decay, warmup period, learning rate scheduler, lora alpha, and lora dropout. The final training jobs for both models use the best hyperparameters observed in the respective sweep runs, we trained both models to 30,000 optimization steps (~6.5 epochs) at an effective batch size of 16.

Experimental Setup

The goal of our study is to determine which screen representation modality serves as a better base for building capable on-device action models. Specifically, our primary interest is to characterize the difference in action-taking performance between models developed for the two representations.

Dataset and Action Space. In light of computational restrictions early on in this project, we use a subset of the AndroidControl dataset,13 filtering out samples with sequence lengths greater than 20,000 tokens. After filtering down the samples in the dataset, we hold out the steps from 1,451 episodes as out test set, and use the steps from the remaining 13,053 episodes as our training set for both models.

Here we use the set of actions that are captured in the AndroidControl dataset.13 Following, 13 we add a status action to the final step of all episodes, which allows the model to signify the completion of an episode with its outcome being successful or infeasible. We employ the following set of actions in our dataset:

actions

actions actions

Action Distributions: This table displays the counts of actions in each of our dataset splits. The distribution of actions is similar in our train and test sets, with click being the most prominent function and the actions long_press and navigate_home having extremely few samples (<\lt10) in both.

Additionally, we provide an illustrative example of one specific sample from our dataset in the appendix.

Evaluation Metrics. As with prior works that train action models in Android environments,13 19 20 we compare model performance through stepwise accuracy. This metric compares model generated actions with ground truth actions from the dataset — where a correctly generated action contains the same function name and arguments. Similarly to,13 we relax the requirements for generated actions, marking semantically equivalent, but not exact, model generations as successful. Specifically, we permit a 3% area around a ground-truth, coordinate based action’s coordinate values. We also consider a relaxation criterion for generated strings in input_text actions. Specifically, we allow model generated values of text where the lowercased string is within a Levenshtein distance21 that is less than or equal to three compared to the lowercased ground truth text string.

We conduct our evaluations through inferencing model checkpoints on samples drawn from the test set and reporting the raw, relaxed stepwise accuracy scores. We provide the models with no few-shot examples and sample all completions with greedy decoding.

Additionally, we analyze the training dynamics of the two approaches. The training and validation loss are compared between both of the modality models’ training runs. Comparisons of the relative trends in these values provide insight into how each representation impacts the learning trajectory. Although trained on different modalities, the supervision between the two approaches is the same — making this relative comparison of trends for the best hyperparameter configurations informative for how the modality impacts learning.

We do not consider episodic-level accuracy in this work, as we filter out steps based on the length of the UI Trees, resulting in several episodes being non-contiguous. This is a key limitation of this work, as ultimately episodic-level accuracy is the key metric for action model performance. However in this preliminary architectural study, we believe that stepwise accuracy in conjunction with other key values provides sufficent information to determine which modality is more favorable as a base for action-taking models.

Results

Our main results are displayed in the following figure and table:

test-acc-curve

test-acc-table

Test Set Accuracy: Top, the stepwise accuracy on the test set for model checkpoints. Bottom, the best results observed in both model runs, along with the base models. *Note that the base models were evaluated with a tailored prompt in the same zero-shot scenario.

We find that for the dataset considered, the text-model has the best absolute value of stepwise accuracy of 64.76% at 21,000 optimization steps. The best performance of the vision model is just slightly behind at 63.55% and occurs later in the vision model’s training at 28,000 optimization steps. For the two best scores, we conduct a homoscedastic t-test and find that the difference between the best model performances is not statistically significant at any standard value of α{0.01,0.05,0.10}\alpha \in \set{0.01, 0.05, 0.10}.

This finding is quite interesting, as for the dataset and set of actions we consider, the screen representation’s modality does not appear to drastically impact stepwise accuracy performance. We then examined the errors for individual functions for each of the best checkpoint evaluation results:

v5-error-cm|center v6-error-cm|center

Confusion Matrix of Incorrect Predictions: In these two tables, we display the incorrect action predictions. The sum of each “Ground Truth” row corresponds to the total number of incorrect model action predictions for that action. For a given row, each “Predicted” column indicates that the model generated an action with that function as its choice. The diagonal of the confusion matrix (highlighted in yellow) indicates predictions where the model-chosen action was aligned with the ground-truth but the values for the action were incorrect. Non-diagonal elements correspond to an incorrect choice of function.

Here we note that the visual-model tended to select functions more appropriately across the board than with text-model, whereas the text-model was more adept in its predictions for text based values (lower error rates for input_text and open_app). The click function was by far the most problematic for both models, as most errors occur in click predictions. This function is the most common in our dataset and is also the function which is most often incorrectly predicted in lieu of other ground-truth functions.

We inspect the incorrect click functions for both models and find that the difference in “misclicks”, the failed clicks due to incorrect coordinates, are quite different between these two models. The vision-model has a 1.328x smaller average absolute deviation in its predicted coordinates, when compared to the text-model, along with a smaller standard deviation. Of these misclicks, we observe two distinct failure modes in both models: completely incorrect clicks and near-misses. We include several illustrative examples of these fail cases in the Appendix.

For the completely incorrect clicks, we identified a distinct pattern wherein one coordinate (either x or y) deviated substantially from the ground-truth values, whereas the other coordinate demonstrated a high degree of accuracy, often aligning precisely or closely with the corresponding coordinate in the ground-truth values. We examined the locations of these misclicks by overlaying the predicted coordinates on the screenshots, frequently these incorrect coordinates lay on completely unrelated regions of the screen. We hypothesize that this odd abrupt deviation in the coordinate values is due to an uneven or biased distribution over the the supervised coordinate values of x and y, we leave further investigation of this failure mode and the resolution of it to future work.

For the cases of near-misses that we observe, we see that a model will predict a coordinate that is relatively close to the ground-truth, but selects a related but incorrect element on the screen or slightly misses the boundary of the ground-truth element. We note that these types of errors tend to occur in complex UIs with many elements, such as the calculator application or in the photos application.

While both representations of screens contain similar information, they represent that information in different ways. We find that near-misses for both modalities tend to occur in a manner that reflects the representation. Specifically, the text-model’s misclicks frequently select similar or related elements which share text with the ground-truth element in the UI Tree and the vision-model’s misclicks are slightly off spatially, occasionally missing the bounding box by a few pixels. We present a few samples that display these failure modes in the appendix.

Training Dynamics and Efficiency. As part of our study, we analyze the graphs corresponding to the train and validation losses. These charts are taken from the main training jobs of the text and visual models.

Text Model Learning

Vision Model Learning

Learning curves for both runs

Some notable points on these curves are:

We find that critically, multimodal models need less tokens to represent the same amount of information. Below we plot the token distributions for the text and visual models’ test datasets:

text-modality-tokendistplot|center|600 vision-modality-tokendistplot|center|600

Test Set Token Distributions: These plots display the distribution of tokens required to represent a sample to each model. The vision-model requires far fewer tokens than the text-model. For both models we format the samples with the training-time chat templates. This set of samples is filtered to contain less than 20,000 tokens; the UI Trees in the full dataset are typically much longer.

The distribution of input tokens required to represent a sample is significantly smaller for the visual domain than it is for the textual domain – typically 2 to 20x smaller. This significant reduction is due to using the visual encoder as a compressor of the base image. The test set accuracy demonstrates that the reduction in sequence length afforded by the compressed image representation performs on par with the UI Trees.

Conclusion

Our study finds that the representation modality of the screen states on Android devices does not have a significant effect on the stepwise accuracy of action models. However the choice does have significant implications for efficiency and training dynamics, with the visual modality offering comparable performance for far fewer tokens and avoids overfitting whilst continuing to learn. In resource constrained environments, where action models need to be effective and efficient, the visual modality presents a compelling architecture for the development of capable agentic systems on mobile devices.

Thank you to Abhay Kashyap, Nicole Fitzerald, Sudhanshu Ranjan, Daniel Bulhosa Solórzano, and Nate Harada for invaluable feedback on preliminary drafts of this post.

Appendix

Example of a Data Sample. This is one data sample from AndroidControl that we used to train the model:

Illustrative Misclicks

In each of the following figures, the ground-truth coordinates for the click are colored in green, the text-model’s predicted coordinates are colored in red and the vision model’s predictions are in blue.

Completely Incorrect

n1

n2

n3

n4

Near-Misses

n1

n2

n3

n4

n5

n6

Quite Positive Model Predictions

n1

n2

n3

n4

n5

n6

Supplementary Training Curves

Here we display the learning curves of the two models, relative to each other.

Train Loss

Evaluation Loss

Learning curves relative to each model

References

Footnotes

  1. OpenAI Computer Use- OpenAI. Computer-Using Agent: Introducing a Universal Interface for AI to Interact with the Digital World. 2025, https://openai.com/index/computer-using-agent.

  2. Claude Computer Use- Developing a Computer Use Model. https://www.anthropic.com/news/developing-computer-use. Accessed 13 Feb. 2025.

  3. DigiRL- Bai, Hao, et al. DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning. arXiv:2406.11896, arXiv, 14 June 2024. arXiv.org, https://doi.org/10.48550/arXiv.2406.11896.

  4. UI-TARS- Qin, Yujia, et al. UI-TARS: Pioneering Automated GUI Interaction with Native Agents. arXiv:2501.12326, arXiv, 21 Jan. 2025. arXiv.org, https://doi.org/10.48550/arXiv.2501.12326.

  5. Ferret-UI- You, Keen, et al. Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs. arXiv:2404.0.819, arXiv, 8 Apr. 2024. arXiv.org, https://doi.org/10.48550/arXiv.2404.0.819.

  6. ReACT- Yao, Shunyu, et al. ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629, arXiv, 10 Mar. 2023. arXiv.org, https://doi.org/10.48550/arXiv.2210.03629.

  7. Voyager- Wang, Guanzhi, et al. Voyager: An Open-Ended Embodied Agent with Large Language Models. arXiv:2305.16291, arXiv, 19 Oct. 2023. arXiv.org, https://doi.org/10.48550/arXiv.2305.16291.

  8. Masterman, Tula, et al. The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling: A Survey. arXiv:2404.11584, arXiv, 17 Apr. 2024. arXiv.org, https://doi.org/10.48550/arXiv.2404.11584.

  9. Zhang, Jiwen, et al. Android in the Zoo: Chain-of-Action-Thought for GUI Agents. arXiv:2403.0.813, arXiv, 13 July 2024. arXiv.org, https://doi.org/10.48550/arXiv.2403.0.813.

  10. Wen, Hao, et al. AutoDroid: LLM-Powered Task Automation in Android. arXiv:2308.15272, arXiv, 9 Mar. 2024. arXiv.org, https://doi.org/10.48550/arXiv.2308.15272.

  11. Yan, An, et al. GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation. arXiv:2311.07562, arXiv, 13 Nov. 2023. arXiv.org, https://doi.org/10.48550/arXiv.2311.07562. 2

  12. UI Trees on Android- https://developer.android.com/develop/ui/compose/accessibility/semantics

  13. AndroidControl- Li, Wei, et al. On the Effects of Data Scale on UI Control Agents. arXiv:2406.03679, arXiv, 13 Nov. 2024. arXiv.org, https://doi.org/10.48550/arXiv.2406.03679. 2 3 4 5 6 7 8

  14. Grattafiori, Aaron, et al. The Llama 3 Herd of Models. arXiv:2407.21783, arXiv, 23 Nov. 2024. arXiv.org, https://doi.org/10.48550/arXiv.2407.21783. 2

  15. Laurençon, Hugo, et al. Building and Better Understanding Vision-Language Models: Insights and Future Directions. arXiv:2408.12637, arXiv, 22 Aug. 2024. arXiv.org, https://doi.org/10.48550/arXiv.2408.12637. 2 3

  16. Zhai, Xiaohua, et al. Sigmoid Loss for Language Image Pre-Training. arXiv:2303.15343, arXiv, 27 Sept. 2023. arXiv.org, https://doi.org/10.48550/arXiv.2303.15343.

  17. CrossEntropyLoss — PyTorch 2.6 Documentation. https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html. Accessed 13 Feb. 2025.

  18. Hu, Edward J., et al. LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685, arXiv, 16 Oct. 2021. arXiv.org, https://doi.org/10.48550/arXiv.2106.09685.

  19. Grattafiori, Aaron, et al. The Llama 3 Herd of Models. arXiv:2407.21783, arXiv, 23 Nov. 2024. arXiv.org, https://doi.org/10.48550/arXiv.2407.21783.

  20. Zheng, Boyuan, et al. GPT-4V(Ision) Is a Generalist Web Agent, If Grounded. arXiv:2401.01614, arXiv, 12 Mar. 2024. arXiv.org, https://doi.org/10.48550/arXiv.2401.01614.

  21. Levenshtein Distance - Wikiwand. https://www.wikiwand.com/en/articles/Levenshtein%20distance. Accessed 13 Feb. 2025.

  22. See the Android Control paper for details on how coordinates are used. In this dataset, increasing values of X move from the top left corner of a screenshot towards the right hand side. Increasing Y values increment from the top of the screen down towards the bottom.