Challenges & Opportunities in Benchmarking Social Robot Navigation
   
AC-RSN 2025: Advances and Challenges in Robot Social Navigation
   
   
   
    Dr Luis J. Manso
    Senior Lecturer (Associate Professor) in Computer Science - Aston University
    Autonomous Robotics and Perception Laboratory - https://arp-lab.com
    https://ljmanso.com

Have you reviewed any SocNav paper lately?


... where a navigation algorithm is proposed...
... what metrics and baselines were used?
... how were the conclusions justified?
Reflections a/ benchmarking, some of which popped up while reviewing papers...
   
  1. Literature review (poster)
  2. Challenges
  3. Opportunities (with a bit of self-promotion)
   

(I know... that's the closest I could get)

Facts

“A short methodological review on social robot navigation benchmarking”

Pranup Chhetri, Alejandro Torrejón, Sergio Eslava, and Luis J. Manso

1. What metrics and algorithms are used?
2. How frequent are surveys w/ human raters?
3. How are benchmarking results interpreted?

PRISMA diagram

“social robot navigation” OR “social navigation”

Sankey diagram

Quantitative benchmarking?

  • $\mathbf{17.6\%}$ did not provide any quantitative performance benchmarking
  • Some provided metrics for their own algorithm exclusively

Metric selection (33 metrics /82)

Metric selection (33 metrics /82)

Metric selection (33 metrics /82)

2.85 on average (3.25)
2.01 non-social (2.48)
0.84 social (1.04)
31% don't use social metrics (21%)
Can we cover what we want to optimise for with 3-4 metrics?
How can a paper claim better results?

Baseline selection 1.46 avg. (0.47 social, post 2000)


ORCA and DWA are not SocNav, so comparing against them is unfair. SFM was great, in 1995! DWA is from 2002, ORCA is from 2011, SARL 2019.

Baseline selection 1.46 avg. (0.47 social, post 2000)


ORCA and DWA are not SocNav, so comparing against them is unfair. SFM was great, in 1995! DWA is from 2002, ORCA is from 2011, SARL 2019.
Do you see where I'm going?

Challenges

  • Selection of metrics
  • Selection of baselines
  • Drawing conclusions from experiments
  • Limitations of current metrics

Selection of metrics

($~3$ metrics average, $~1$ social [$D_{min} ,\; SC$] )
  •  
  • How many metrics do we need to cover all aspects we are interested in?
    • Is increasing the number of metrics always positive?
  •  
  • Multiple implementations for the same metric "as advertised". Vague descriptions.

Selection of baselines: few and unfair

($\lt{}2$ baselines average, $\lt{}1$ social )
  • Real-life constraints (e.g., time)
  • Baselines are not PnP
    • No standardised interface
    • Structured vs. end2end clash
  • Initiatives still need work: Arena 4.0, Sean 2.0 (observation, interfacing, docs)
  • Could one baseline be enough?
    • only when there's a known clear winner

Drawing conclusions

  • “Our algorithm did better than the rest of the baselines in N out of M metrics”
  • Problem: Not all metrics are the same
  • Problem: More metrics → more redundancy → harder to decide
  • Difficult to balance with the fact that the current metrics don't account for everything!
  • Not only the best performing paper is worth publishing (e.g., specific scenarios/features)

Limitations of current metrics

  • Lack of experimental evidence: development and usage!
  • The "variable spiral"(three examples)
  • 1. Interactions (h2h h2o)
  • 2. Density: (high)
  • 2. Density: (low)
  • 3. Task context

Solution?

  • Adding more metrics would complicate interpretation
  • We cannot not do anything about it!

Two opportunities

Opportunity: Benchmarking platform

  • Leaderboard
  • Support for both structured and raw observations
  • Loaded with baseline implementations
  • Loaded with community-verified metric implementations
  • Easy to use (podman/docker)
  • Realistic & varied human behaviour
  • Diverse scenarios
  • Let's not reinvent the wheel

Opportunity: Agreed set of weighted metrics?

Could we have one single metric to control them all?

A taxonomy of metrics

A Remember "the bitter lesson" L T

Francis, Anthony, et al. "Principles and guidelines for evaluating social robot navigation algorithms." ACM Transactions on Human-Robot Interaction 14.2 (2025): 1-65.

The aims

  • 🎯 Overarching: increased satisfaction users & companies
  • Acceptance (inc. comfort, legibility...)
  • Efficiency (inc. success rate, time, path length)
  • Evidence-based solution (move away from intuition)

The plan (submitted working prototype - https://arxiv.org/pdf/2509.01251)


  • Need to scale the dataset
  • Involve companies to make real-life impact

Raw data acquisition

Survey and survey results

  • Combined dataset with $4,427$ trajectories
    ($182$ real and $4,245$ simulated)
  • $49$ participants scored $6,481$ trajectories.
  • Human raters generated $4,402$ rated trajectories after data quality assurance.

Survey and survey results

  • Combined dataset with $4,427$ trajectories
    ($182$ real and $4,245$ simulated)
  • $49$ participants scored $6,481$ trajectories.
  • Human raters generated $4,402$ rated trajectories after data quality assurance.
Quadratic weighted Cohen's kappa coefficient on control questions

Quantitative results

  • Validation loss of $0.0469$ (MSE)
  • Test loss: $\mathbf{0.0477}$ (MSE), $\mathbf{0.162}$ (MAE)
  • Loss w.r.t. the mean of the control Qs were $0.0062$ (MSE) & $0.066$ (MAE).

Qualitative results

1 pedestrian, static (point to where to look)


0.2 m/s          0.4 m/s          0.8 m/s          1.6 m/s         

3 pedestrian, static (point to where to look)


0.2 m/s          0.4 m/s          0.8 m/s          1.6 m/s         

What went wrongcould be improved:

  • Time efficiency
  • Geometry-aware features
  • Need for more contributors!
    • Need for more trajectory data (sim&real)
      • Exploit existing dataset to get scores

What went wrongcould be improved:

  • Time efficiency
  • Geometry-aware features
  • Need for more contributors!
    • Need for more trajectory data (sim&real)
      • Exploit existing dataset to get scores

What went wrong could be improved

  • Time efficiency
  • Geometry-aware features
  • Need for more contributors!
    • Need for more trajectory data (sim&real)
    • Exploit existing dataset to get scores

Team work!

  • → Pilar Bachiller (Universidad de Extremadura) ←
  • Ulysses Bernardet (Aston University)
  • Luis V. Calderita (Universidad de Extremadura)
  • Pranup Chhetri (Aston University)
  • Anthony Francis (Logical Robotics)
  • Noriaki Hirose (UC Berkeley / TOYOTA)
  • Noe Perez (Universidad Pablo Olavide)
  • Dhruv Shah (Google DeepMind)
  • Phani T. Singamaneni (LAAS-CNRS)
  • Xuesu Xiao (George Mason University)
  • Luis J. Manso (Aston University)

Challenges & Opportunities in Benchmarking Social Robot Navigation

   

Take home points:

  • We need to take benchmarking more seriously.
  • Hard to extract conclusions otherwise.
  • Opportunities:
    • ALT prototype metric
    • Need for a benchmarking software platform
   

Resources: