Compress2Focus
Efficient Coordinate Compression for Policy Optimization in Multi-Turn GUI Agents

Yurun Song1,2,*, Jiong Yin1,3,*, Rongjunchen Zhang1,♠, Ian Harris2,♠
1HiThink Research, 2University of California, Irvine, 3Hangzhou Dianzi University
* Equal Contribution, ♠ corresponding authors
Teaser
Figure 1. Existing multi-turn methods tend to truncate the visual history due to the limited context length. The proposed CCPO method preserves the key visual history to maintain the longer trajectory visibility.

Abstract

Multi-turn GUI agents enable complex task completion through sequential decision-making, but suffer from severe context inflation as interaction history accumulates. Existing strategies either sacrifice long-term context via truncation or compromise spatial structure through token pruning. In this paper, we propose Coordinate Compression Policy Optimization (CCPO), an efficient policy optimization framework that couples visual compression with policy optimization for multi-turn GUI agents. CCPO introduces Coordinate-Aware Spatial Compression (CASC), which aggregates coordinates from multiple rollouts to capture target-relevant regions and progressively narrow historical attention around key visual areas. From interactions across rollouts, CASC adaptively constructs attention boundaries that concentrate computation on the most informative regions of the scene. We further design a Distance-Based Advantage that provides fine-grained learning signals based on distance rather than binary correctness, improving both grounding accuracy and compression quality. Extensive experiments demonstrate that CCPO achieves SOTA performance across four benchmarks with up to 55% token compression and 3.8 × training speedup.

Methodology

Method Architecture
Figure 2. Overview of CCPO framework. The training phase (top) optimizes policies via multi-turn rollouts evaluated by the Distance-Aware Advantage. The Coordinate-Aware Spatial Compression module (bottom) tracks n actions and aggregates coordinates to predict ROI of each step, then crop the task-relevant region as a focused visual history ht+1.

Experimental Results

Training Efficiency

Model History
Length
Token
Length ↓
Compression
Ratio ↑
Training
Time (s/step)
SO-RL-3B 1AO 6998 0.0% 515
3AO 9888 0.0% 660
CCPO-3B 1AO 4271 38.9% 154 (3.3×)
3AO 4460 54.9% 174 (3.8×)
SO-RL-7B 1AO 7026 0.0% 569
3AO 9550 0.0% 717
CCPO-7B 1AO 4262 39.3% 186 (3.1×)
3AO 4473 53.2% 204 (3.5×)

Table 1. The training efficiency comparison between CCPO and Semi-Online RL on Android Control dataset.

Method Compute Load
(TFLOPS) ↓
Token Latency
(ms) ↓
Step Latency
(s) ↓
SO-RL 9.6 0.064 297.1
CCPO 5.4 (-44%) 0.057 (-10%) 194.5 (-35%)

Table 2. Training efficiency comparison in terms of compute load and latency.

Results on Android Control and GUI Odyssey datasets

Model History Format
AOT
Android Control High GUI Odyssey
TM GR SR TM GR SR
Open-source Models
OS-Atlas-4B ZS A 49.0 49.5 22.8 49.6 34.6 20.3
OS-Atlas-4B FT A 84.7 73.8 67.5 83.5 61.4 56.4
Qwen2.5VL-3B A 47.8 46.5 38.9 37.4 26.5 26.7
UI-R1-3B -- 57.9 55.7 45.4 52.2 34.5 32.5
GUI-R1-3B A 58.0 56.2 46.6 54.8 41.5 41.3
OS-Genesis-7B AO 65.9 - 44.4 11.7 - 3.6
Aguvis-7B A 65.6 - 54.2 26.7 - 13.5
GUI-R1-7B A 71.6 65.6 51.7 65.5 43.6 38.8
AgentCPM-GUI-8B A 77.7 - 69.2 90.8 - 75.0
OS-Atlas-7B ZS A 57.4 54.9 29.8 60.4 39.7 27.0
OS-Atlas-7B FT A 85.2 78.5 71.2 84.5 67.8 62.0
UI-TARS-7B AOT 83.7 80.5 72.5 94.6 90.1 87.0
UI-S1-7B AOT 79.9 73.4 68.2 76.3 61.7 59.5
Our Models
Qwen2.5VL-3B (0-shot) AO 24.9 68.3 20.2 27.8 46.4 14.7
w/ SFT AO 85.2 73.5 68.6 88.0 84.3 75.9
w/ Semi-online RL AO 83.7 74.8 67.5 82.6 81.3 71.3
CCPO-3B-1AO AO 85.3 76.7 70.6 91.7 87.2 81.1
CCPO-3B-3AO AO 85.7 77.5 70.8 90.6 88.5 80.9
Qwen2.5VL-7B (0-shot) AO 58.9 70.3 44.1 55.8 50.8 31.8
w/ SFT AO 85.9 75.9 70.6 88.0 84.6 76.0
w/ Semi-online RL AO 86.3 76.7 70.6 89.2 84.9 76.7
CCPO-7B-1AO AO 86.4 78.8 72.2 91.1 87.2 80.3
CCPO-7B-3AO AO 86.9 79.7 73.3 91.8 89.3 82.4

Table 3. Results of our CCPO model on the Android Control and GUI-Odyssey navigation tasks. In the History format, AOT means the model includes Action, Observation, and Thought history, respectively.

Results on Mind2Web and AITW

Method Param Mind2Web AITW
Cross-Task Cross-Website Cross-Domain Overall ClickAvg
Qwen-VL 9.6B 9.6B 13.3 9.2 12.0 54.3 57.4
SeeClick 9.6B 25.5 16.4 20.8 59.3 66.4
R-VLM 9.6B 28.7 26.1 24.3 64.9 71.0
Iris 9.6B 32.0 26.2 28.8 63.6 71.0
Qwen2-VL 2B 46.7 42.2 44.6 57.7 --
ShowUI-2B 2B 37.2 35.1 35.2 70.0 --
SimpAgent 2B 48.7 42.2 45.0 71.5 --
TongUI-3B 2B 48.8 48.1 49.5 71.6 --
TongUI-7B 7B 53.4 49.0 52.9 73.3 --
Qwen2.5-VL-3B w/ SFT 3B 52.0 46.5 48.7 70.8 78.4
CCPO-3B 1AO 3B 54.6 50.6 50.6 71.8 79.7
CCPO-3B 3AO 3B 56.5 51.0 51.8 73.1 80.4
Qwen2.5-VL-7B w/ SFT 7B 55.6 51.3 52.0 72.3 80.2
CCPO-7B-1AO 7B 58.0 53.4 55.7 73.5 81.0
CCPO-7B-3AO 7B 59.5 53.6 56.5 74.4 81.4

Table 4. Results of CCPO on the Mind2Web and AITW benchmarks across different settings.

Results on AITW Benchmark

Method General Single Web Shopping Install Google Apps Overall ClickAvg
Qwen-VL 9.6B 49.5 64.7 50.7 59.9 46.9 54.3 57.4
SeeClick 54.0 73.7 57.6 66.4 54.9 59.3 66.4
R-VLM 59.9 72.5 61.7 70.6 59.6 64.9 71.0
Qwen2-VL 48.3 57.8 51.6 77.4 52.9 57.7 --
Iris 61.5 71.4 58.3 66.4 60.2 63.6 71.0
ShowUI-2B 63.9 77.5 66.6 72.5 69.7 70.0 --
SimpAgent 64.1 76.2 67.2 75.8 74.0 71.5 --
TongUI-3B 65.6 77.0 65.8 75.1 74.5 71.6 --
TongUI-7B 67.6 79.9 69.1 76.3 73.5 73.3 --
Qwen2.5-VL-3B w/ SFT 61.5 75.4 67.2 75.8 74.1 70.8 78.4
CCPO-3B 1AO w/o CR 62.7 78.2 65.1 75.5 76.4 71.6 79.1
CCPO-3B 1AO 64.3 76.1 67.2 76.1 75.4 71.8 79.7
CCPO-3B 3AO w/o CR 65.2 79.2 66.6 76.5 75.8 72.7 80.0
CCPO-3B 3AO 65.3 77.5 68.3 78.3 76.0 73.1 80.4
Qwen2.5-VL-7B w/ SFT 64.8 77.5 68.5 76.9 73.9 72.3 80.2
CCPO-7B 1AO w/o CR 66.4 79.4 67.5 75.9 76.2 73.1 79.3
CCPO-7B-1AO 67.0 78.2 68.7 77.3 76.2 73.5 81.0
CCPO-7B 3AO w/o CR 64.9 79.4 70.0 77.3 79.0 74.1 80.5
CCPO-7B-3AO 68.3 78.7 69.6 77.3 78.0 74.4 81.4

Table 5. Results of CCPO-MAX on the AITW benchmark.

Qualitative Comparison

Case 1

SFT Baseline

SFT Baseline

CCPO (Ours)

CCPO (Ours)

Case 2

SFT Baseline

SFT Baseline

CCPO (Ours)

CCPO (Ours)

Case 3

SFT Baseline

SFT Baseline

CCPO (Ours)

CCPO (Ours)

Case 4

SFT Baseline

SFT Baseline

CCPO (Ours)

CCPO (Ours)

Case 5

SFT Baseline

SFT Baseline

CCPO (Ours)

CCPO (Ours)

Case 6

SFT Baseline

SFT Baseline

CCPO (Ours)

CCPO (Ours)

Case 7

SFT Baseline

SFT Baseline

CCPO (Ours)

CCPO (Ours)

Case 8

SFT Baseline

SFT Baseline

CCPO (Ours)

CCPO (Ours)

Failure Case 1

SFT Baseline

SFT Baseline

CCPO (Ours)

CCPO (Ours)

Failure Case 2

SFT Baseline

SFT Baseline

CCPO (Ours)

CCPO (Ours)

Analysis

Coordinate-Based Actions Distribution for Three Datasets

GUI Odyssey

GUI Odyssey

AITW

AITW

Android Control

Android Control

Performance Comparison for AC Datasets

Model AO TM GR SR
Qwen2.5-VL-7B 1AO 83.75 74.95 67.97
2AO 85.30 75.95 70.00
3AO 85.94 75.95 70.60
4AO 84.89 75.77 69.65
CCPO-7B 1AO 86.45 78.80 72.18
2AO 86.86 79.48 73.19
3AO 86.89 79.71 73.25
4AO 86.27 80.20 73.11

Table 6. Performance comparison for AC datasets from 1AO to 4AO.

Ablation Study

Method AC-TM AC-GR AC-SR
Qwen2.5VL-7B SFT 85.94 75.95 70.60
+ Semi-online 86.27 (+0.33) 77.93 (+1.98) 72.35 (+1.75)
+ CASC 86.72 (+0.78) 79.12 (+3.17) 72.70 (+2.1)
+ CASC + CR 86.89 (+0.95) 79.71 (+3.76) 73.25 (+2.65)

Table 7. Ablation study of different components on the Android Control dataset.

BibTeX

@article{Anonymous2026compress2focus,
  author    = {Anonymous},
  title     = {Compress to Focus: Efficient Coordinate Compression for Policy Optimization in Multi-Turn GUI Agents},
  journal   = {xxxx},
  year      = {2026},
}
Preview