The “AI and the Everything in the Whole Wide World Benchmark” paper argues that the practice of benchmarking is a poor way of measuring the progress of the field of AI towards general intelligence. Due to the sheer size, extensive category list, and community consensus that these benchmarks have whenever there is progress on these benchmarks it is assumed to be progress towards general capabilities. There are inherent subjective biases in these benchmarks. A significant number of these standard datasets are prepared based on a particular dataset rather than thoroughly describing the task to make them. Criteria that we use to filter the tasks don’t reflect attention to the development of general-purpose capabilities. Financial influence from big corporations is another factor that tends to influence the general directions of research topics and make the whole research field towards focusing on empirical and incremental work rather than dealing with more relevant real-world problems.

Some approaches the writers use include historical definitions of benchmarks and thair presentation in recent influential papers both from the creators of the benchmarks and the scientific community that use them. Benchmarks like GLUE and ImageNet are among the examples presented to show this problem. In the specific case studies of benchmarks, the writers introduced a list of limitations they see and after discussing those challenges they went on suggesting alternative evaluation roles for benchmarks and better evaluation methods to ensure we account for progress towards general intelligence.

The ideas presented in this paper were an original contribution to the research community. A compelling argument that is supported by historical evidence and current trends is presented in the paper. The fact that the paper uses the most influential benchmarks as an example to solidify the idea presented makes the reader aware of the seriousness of the problem. The main question I have is what do these ideal benchmarks look like? The benchmarks that are in the use-case are NLP and Computer Vision benchmarks, so I would like to see what general-purpose benchmarks would mean in that narrow context. Can we say these benchmarks are a good example of general-purpose benchmarks in the computer vision context? This paper could use a bit of simplification while explaining some points. The fact that the paper uses stories, references, and examples to clarify the points was fascinating.

Finally, I would like to see some practical use cases that tried to do what is recommended in this paper and have a real-world example for researchers. I wish to see the effect of this paper and the suggestions being accepted and adopted in the field.

The Paper: arxiv.org/abs/2111.15366