Challenges & Opportunities in Benchmarking Social Robot Navigation

AC-RSN 2025: Advances and Challenges in Robot Social Navigation

Dr Luis J. Manso

Senior Lecturer (Associate Professor) in Computer Science - Aston University

Autonomous Robotics and Perception Laboratory - https://arp-lab.com

Have you reviewed any SocNav paper lately?

... where a navigation algorithm is proposed...
... what metrics and baselines were used?
... how were the conclusions justified?

Reflections a/ benchmarking, some of which popped up while reviewing papers...

Literature review (poster)
Challenges
Opportunities (with a bit of self-promotion)

(I know... that's the closest I could get)

⬇

Facts

“A short methodological review on social robot navigation benchmarking”

Pranup Chhetri, Alejandro Torrejón, Sergio Eslava, and Luis J. Manso

1. What metrics and algorithms are used?
2. How frequent are surveys w/ human raters?
3. How are benchmarking results interpreted?

PRISMA diagram

“social robot navigation” OR “social navigation”

Sankey diagram

Quantitative benchmarking?

$\mathbf{17.6\%}$ did not provide any quantitative performance benchmarking
Some provided metrics for their own algorithm exclusively

Metric selection (33 metrics /82)

2.85 on average (3.25)
2.01 non-social (2.48)
0.84 social (1.04)
31% don't use social metrics (21%)

Can we cover what we want to optimise for with 3-4 metrics?
How can a paper claim better results?

Baseline selection 1.46 avg. (0.47 social, post 2000)

ORCA and DWA are not SocNav, so comparing against them is unfair. SFM was great, in 1995! DWA is from 2002, ORCA is from 2011, SARL 2019.

Baseline selection 1.46 avg. (0.47 social, post 2000)

ORCA and DWA are not SocNav, so comparing against them is unfair. SFM was great, in 1995! DWA is from 2002, ORCA is from 2011, SARL 2019.

Do you see where I'm going?

Challenges

Selection of metrics
Selection of baselines
Drawing conclusions from experiments
Limitations of current metrics

Selection of metrics

($~3$ metrics average, $~1$ social [$D_{min} ,\; SC$] )

How many metrics do we need to cover all aspects we are interested in?
- Is increasing the number of metrics always positive?
Multiple implementations for the same metric "as advertised". Vague descriptions.

Selection of baselines: few and unfair

($\lt{}2$ baselines average, $\lt{}1$ social )

Real-life constraints (e.g., time)
Baselines are not PnP
- No standardised interface
- Structured vs. end2end clash
Initiatives still need work: Arena 4.0, Sean 2.0 (observation, interfacing, docs)
Could one baseline be enough?
- only when there's a known clear winner

Drawing conclusions

“Our algorithm did better than the rest of the baselines in N out of M metrics”
Problem: Not all metrics are the same
Problem: More metrics → more redundancy → harder to decide
Difficult to balance with the fact that the current metrics don't account for everything!
Not only the best performing paper is worth publishing (e.g., specific scenarios/features)

Limitations of current metrics

Lack of experimental evidence: development and usage!
The "variable spiral"(three examples)

1. Interactions (h2h h2o)
2. Density: (high)
2. Density: (low)
3. Task context

Solution?

Adding more metrics would complicate interpretation
We cannot not do anything about it!

Two opportunities

Opportunity: Benchmarking platform

Leaderboard
Support for both structured and raw observations
Loaded with baseline implementations
Loaded with community-verified metric implementations
Easy to use (podman/docker)
Realistic & varied human behaviour
Diverse scenarios
Let's not reinvent the wheel

Opportunity: Agreed set of weighted metrics?

Could we have one single metric to control them all?

A taxonomy of metrics

Francis, Anthony, et al. "Principles and guidelines for evaluating social robot navigation algorithms." ACM Transactions on Human-Robot Interaction 14.2 (2025): 1-65.

The aims

🎯 Overarching: increased satisfaction users & companies
↑ Acceptance (inc. comfort, legibility...)
↑ Efficiency (inc. success rate, time, path length)
↑ Evidence-based solution (move away from intuition)

The plan (submitted working prototype - https://arxiv.org/pdf/2509.01251)

Need to scale the dataset
Involve companies to make real-life impact

Raw data acquisition

Survey and survey results

Combined dataset with $4,427$ trajectories
($182$ real and $4,245$ simulated)
$49$ participants scored $6,481$ trajectories.
Human raters generated $4,402$ rated trajectories after data quality assurance.

Survey and survey results

Combined dataset with $4,427$ trajectories
($182$ real and $4,245$ simulated)
$49$ participants scored $6,481$ trajectories.
Human raters generated $4,402$ rated trajectories after data quality assurance.

Quadratic weighted Cohen's kappa coefficient on control questions

Quantitative results

Validation loss of $0.0469$ (MSE)
Test loss: $\mathbf{0.0477}$ (MSE), $\mathbf{0.162}$ (MAE)
Loss w.r.t. the mean of the control Qs were $0.0062$ (MSE) & $0.066$ (MAE).

Qualitative results

1 pedestrian, static (point to where to look)

0.2 m/s — 0.4 m/s — 0.8 m/s — 1.6 m/s —

3 pedestrian, static (point to where to look)

0.2 m/s — 0.4 m/s — 0.8 m/s — 1.6 m/s —

What went wrongcould be improved:

Time efficiency
Geometry-aware features
Need for more contributors!
- Need for more trajectory data (sim&real)
  - Exploit existing dataset to get scores

What went wrongcould be improved:

Time efficiency
Geometry-aware features
Need for more contributors!
- Need for more trajectory data (sim&real)
  - Exploit existing dataset to get scores

What went wrong could be improved

Time efficiency
Geometry-aware features
Need for more contributors!
- Need for more trajectory data (sim&real)
- Exploit existing dataset to get scores

Team work!

→ Pilar Bachiller (Universidad de Extremadura) ←
Ulysses Bernardet (Aston University)
Luis V. Calderita (Universidad de Extremadura)
Pranup Chhetri (Aston University)
Anthony Francis (Logical Robotics)
Noriaki Hirose (UC Berkeley / TOYOTA)
Noe Perez (Universidad Pablo Olavide)
Dhruv Shah (Google DeepMind)
Phani T. Singamaneni (LAAS-CNRS)
Xuesu Xiao (George Mason University)
Luis J. Manso (Aston University)

Challenges & Opportunities in Benchmarking Social Robot Navigation

Take home points:

We need to take benchmarking more seriously.
Hard to extract conclusions otherwise.
Opportunities:
- ALT prototype metric
- Need for a benchmarking software platform

Resources:

slides: https://ljmanso.github.io/ecmr25/
paper (preprint): “Towards Data-Driven Metrics for Social Robot Navigation Benchmarking”
repository: https://github.com/SocNavData/SocNav3
email: l.manso@aston.ac.uk
Bluesky: @ljmanso.bsky.social