IntPhys: A Benchmark for Visual Intuitive Physics Reasoning¶
In order to reach human performance on complex visual tasks, artificial systems need to incorporate a significant amount of understanding of the world in terms of macroscopic objects, movements, forces, etc. This raises the challenge of how to evaluate intuitive physics in such artificial systems, especially if these systems are not constructed directly with intuitive physics as an objective function (unsupervised or weakly supervised learning). Drawing inspiration from research in developmental psychology, this benchmark is to provide a diagnostic test for increasingly difficult aspects of intuitive physics.
The intphys benchmark can be applied to any vision system, engineered, or trained, provided it can output a scalar when presented with a video clip, which should correspond to how physically plausible the video clip is. Our test set contains well matched videos of possible versus impossible events, and the metric consists in measuring how well the vision system can tell apart the possible from the impossible events..
Our benchmark is therefore:
- task neutral: it can be applied across very different systems that have been trained on a variety of tasks such as Visual Question Answering, 3D reconstruction, or motor planning.
- model neutral: It only requires models to output a physical plausibility score over an entire video.
- bias free: because the test is synthetic (constructed with a game engine), it enables careful control, which makes it free of the usual biases present in more realistic datasets.
- diagnostic: Attention! it is NOT intended to provide a loss function to train system’s parameters. It’s purpose is to diagnose a system on increasingly complex sub-problems of intuitive physics (object individuation, kinematics, object interactions etc). Therefore the dev set is small (just for tuning the plausibility scalar).
- Benchmark Description
- Test Blocks
- Training Set: learning by observation
- IntPhys Challenge
- Download and resources
- Future benchmarks