... where a navigation algorithm is proposed... ... what metrics and baselines were used? ... how were the conclusions justified?
Reflections a/ benchmarking, some of which popped up while reviewing papers...
Literature review (poster)
Challenges
Opportunities (with a bit of self-promotion)
(I know... that's the closest I could get)
⬇
Facts
“A short methodological review on social robot navigation benchmarking”
Pranup Chhetri, Alejandro Torrejón, Sergio Eslava, and Luis J. Manso
1. What metrics and algorithms are used?
2. How frequent are surveys w/ human raters?
3. How are benchmarking results interpreted?
PRISMA diagram
“social robot navigation” OR “social navigation”
Sankey diagram
Quantitative benchmarking?
$\mathbf{17.6\%}$ did not provide any quantitative performance benchmarking
Some provided metrics for their own algorithm exclusively
Metric selection (33 metrics /82)
Metric selection (33 metrics /82)
Metric selection (33 metrics /82)
2.85 on average (3.25)
2.01 non-social (2.48) 0.84 social (1.04)
31% don't use social metrics (21%)
Can we cover what we want to optimise for with 3-4 metrics? How can a paper claim better results?
Baseline selection 1.46 avg. (0.47 social, post 2000)
ORCA and DWA are not SocNav, so comparing against them is unfair. SFM was great, in 1995! DWA is from 2002, ORCA is from 2011, SARL 2019.
Baseline selection 1.46 avg. (0.47 social, post 2000)
ORCA and DWA are not SocNav, so comparing against them is unfair. SFM was great, in 1995! DWA is from 2002, ORCA is from 2011, SARL 2019.
Do you see where I'm going?
Challenges
Selection of metrics
Selection of baselines
Drawing conclusions from experiments
Limitations of current metrics
Selection of metrics
($~3$ metrics average, $~1$ social [$D_{min} ,\; SC$] )
How many metrics do we need to cover all aspects we are interested in?
Is increasing the number of metrics always positive?
Multiple implementations for the same metric "as advertised". Vague descriptions.
Selection of baselines: few and unfair
($\lt{}2$ baselines average, $\lt{}1$ social )
Real-life constraints (e.g., time)
Baselines are not PnP
No standardised interface
Structured vs. end2end clash
Initiatives still need work: Arena 4.0, Sean 2.0 (observation, interfacing, docs)
Could one baseline be enough?
only when there's a known clear winner
Drawing conclusions
“Our algorithm did better than the rest of the baselines in N out of M metrics”
Problem: Not all metrics are the same
Problem: More metrics → more redundancy → harder to decide
Difficult to balance with the fact that the current metrics don't account for everything!
Not only the best performing paper is worth publishing (e.g., specific scenarios/features)
Limitations of current metrics
Lack of experimental evidence: development and usage!
The "variable spiral"(three examples)
1. Interactions (h2h h2o)
2. Density: (high)
2. Density: (low)
3. Task context
Solution?
Adding more metrics would complicate interpretation
We cannot not do anything about it!
Two opportunities
Opportunity: Benchmarking platform
Leaderboard
Support for both structured and raw observations
Loaded with baseline implementations
Loaded with community-verified metric implementations
Easy to use (podman/docker)
Realistic & varied human behaviour
Diverse scenarios
Let's not reinvent the wheel
Opportunity: Agreed set of weighted metrics?
Could we have one single metric to control them all?
A taxonomy of metrics
Francis, Anthony, et al. "Principles and guidelines for evaluating social robot navigation algorithms." ACM Transactions on Human-Robot Interaction 14.2 (2025): 1-65.