SPARTQA: A Textual Question Answering Benchmark for Spatial

Download SPARTQA: A Textual Question Answering Benchmark for Spatial

Preview text

SPARTQA: A Textual Question Answering Benchmark for Spatial Reasoning

Roshanak Mirzaee

Hossein Rajaby Faghihi Qiang Ning♠∗ Parisa Kordjamshidi Michigan State University ♠Amazon

{mirzaeem,rajabyfa,kordjams} [email protected]

This paper proposes a question-answering (QA) benchmark for spatial reasoning on natural language text which contains more realistic spatial phenomena not covered by prior work and is challenging for state-of-the-art language models (LM). We propose a distant supervision method to improve on this task. Specifically, we design grammar and reasoning rules to automatically generate a spatial description of visual scenes and corresponding QA pairs. Experiments show that further pretraining LMs on these automatically generated data significantly improves LMs’ capability on spatial understanding, which in turn helps to better solve two external datasets, bAbI, and boolQ. We hope that this work can foster investigations into more sophisticated models for spatial reasoning over text.
1 Introduction
Spatial reasoning is a cognitive process based on the construction of mental representations for spatial objects, relations, and transformations (Clements and Battista, 1992), which is necessary for many natural language understanding (NLU) tasks such as natural language navigation (Chen et al., 2019; Roman Roman et al., 2020; Kim et al., 2020), human-machine interaction (Landsiedel et al., 2017; Roman Roman et al., 2020), dialogue systems (Udagawa et al., 2020), and clinical analysis (Datta and Roberts, 2020).
Modern language models (LM), e.g., BERT (Devlin et al., 2019), ALBERT (Lan et al., 2020), and XLNet (Yang et al., 2019) have seen great successes in natural language processing (NLP). However, there has been limited investigation into spatial reasoning capabilities of LMs. To the best of our knowledge, bAbI (Weston et al., 2015) (Fig 9) is the only dataset with direct textual spatial question answering (QA) (Task 17), but it is synthetic

and overly simplified: (1) The underlying scenes are spatially simple, with only three objects and relations only in four directions. (2) The stories for these scenes are two short, templated sentences, each describing a single relation between two objects. (3) The questions typically require up to two-steps reasoning due to the simplicity of those stories.
To address these issues, this paper proposes a new dataset, SPARTQA1 (see Fig. 1). Specifically, (1) SPARTQA is built on NLVR’s (Suhr et al., 2017) images containing more objects with richer spatial structures (Fig. 1b). (2) SPARTQA’s stories are more natural, have more sentences, and richer in spatial relations in each sentence. (3) SPARTQA’s questions require deeper reasoning and have four types: find relation (FR), find blocks (FB), choose object (CO), and yes/no (YN), which allows for more fine-grained analysis of models’ capabilities.
We showed annotators random images from NLVR, and instructed them to describe objects and relationships not exhaustively at the cost of naturalness (Sec. 3). In total, we obtained 1.1k unique QA pair annotations on spatial reasoning, evenly distributed among the aforementioned types. Similar to bAbI, we keep this dataset in relatively small scale and suggest to use as little training data as possible. Experiments show that modern LMs (e.g., BERT) do not perform well in this low-resource setting.
This paper thus proposes a way to obtain distant supervision signals for spatial reasoning (Sec. 4). As spatial relationships are rarely mentioned in existing corpora, we take advantage of the fact that spatial language is grounded to the geometry of visual scenes. We are able to automatically generate stories for NLVR images (Suhr et al., 2017) via our newly designed context free grammars (CFG) and context-sensitive rules. In the process of story generation, we store the information about all ob-

∗Work was done while at the Allen Institute for AI.

1SPAtial Reasoning on Textual Question Answering.

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4582–4598
June 6–11, 2021. ©2021 Association for Computational Linguistics

STORY: We have three blocks, A, B and C. Block B is to the right of block C and it is below block A. Block A has two black medium squares. Medium black square number one is below medium black square number two and a medium blue square. It is touching the bottom edge of this block. The medium blue square is below medium black square number two. Block B contains one medium black square. Block C contains one medium blue square and one medium black square. The medium blue square is below the medium black square.
QUESTIONS: FB: Which block(s) has a medium thing that is below a black square? A, B, C FB: Which block(s) doesn't have any blue square that is to the left of a medium square? A, B FR: What is the relation between the medium black square which is in block C and the medium square that is below a medium black square that is touching the bottom edge of a block? Left CO: Which object is above a medium black square? the medium black square which is in block C or medium black square number two? medium black square number two YN: Is there a square that is below medium square number two above all medium black squares that are touching the bottom edge of a block? Yes
(a) An example story and corresponding questions and answers. A

NLVR image

Described image


choose some objects and


relations randomly and add

relationship between blocks

(b) An example NLVR image and the scene created in Fig. 1a, where the blocks in the NLVR image are rearranged. Figure 1: Example from SPARTQA (specifically from SPARTQA-AUTO)

jects and relationships, such that QA pairs can also

Our contributions can be summarized as fol-

be generated automatically. In contrast to bAbI, lows. First, we propose the first human-curated

we use various spatial rules to infer new relation- benchmark, SPARTQA-HUMAN, for spatial rea-

ships in these QA pairs, which requires more com- soning with richer spatial phenomena than the prior

plex reasoning capabilities. Hereafter, we call this synthetic dataset bAbI (Task 17).

automatically-generated dataset SPARTQA-AUTO, Second, we exploit the scene structure of images

and the human-annotated one SPARTQA-HUMAN. and design novel CFGs and spatial reasoning rules

Experiments show that, by further pretraining on to automatically generate data (i.e., SPARTQA-

SPARTQA-AUTO, we improve LMs’ performance on SPARTQA-HUMAN by a large margin.2 The

AUTO) to obtain distant supervision signals for spatial reasoning over text.

spatially-improved LMs also show stronger per- Third, SPARTQA-AUTO proves to be a rich

formance on two external QA datasets, bAbI and source of spatial knowledge that improved the per-

boolQ (Clark et al., 2019): BERT further pretrained formance of LMs on SPARTQA-HUMAN as well as

on SPARTQA-AUTO only requires half of the train- on different data domains such as bAbI and boolQ.

ing data to achieve 99% accuracy on bAbI as com-
pared to the original BERT; on boolQ’s develop- 2 Related work

ment set, this model shows better performance than BERT, with 2.3% relative error reduction.3

Question answering is a useful format to evaluate machines’ capability of reading comprehen-

2Further pretraining LMs has become a common practice and baseline method for transferring knowledge between tasks (Phang et al., 2018; Zhou et al., 2020). We leave more advanced methods for future work.
3To the best of our knowledge, the test set or leaderboard of boolQ has not been released yet.

sion (Gardner et al., 2019) and many recent works have been implementing this strategy to test machines’ understanding of linguistic formalisms: He et al. (2015); Michael et al. (2018); Levy et al. (2017); Jia et al. (2018); Ning et al. (2020); Du


and Cardie (2020). An important advantage of QA

is using natural language to annotate natural lan-

guage, thus having the flexibility to get annotations

on complex phenomena such as spatial reasoning.

However, spatial reasoning phenomena have been covered minimally in the existing works.
To the best of our knowledge, Task 17 of the bAbI project (Weston et al., 2015) is the only QA

Figure 2: For “A blue circle is above a big triangle. To the left of the big triangle, there is a square,” if the question is: “Is the square to the left of the blue circle?”, the answer is neither Yes nor No. Thus, the correct answer

dataset focused on textual spatial reasoning (exam- is “Do not Know” (DK) in our setting.

ples in Appendix F). However, bAbI is synthetic

and does not reflect the complexity of the spatial reasoning in natural language. Solving Task 17 of bAbI typically does not require sophisticated reasoning, which is an important capability emphasized by more recent works (e.g., Dua et al. (2019); Khashabi et al. (2018); Yang et al. (2018); Dasigi et al. (2019); Ning et al. (2020)).
Spatial reasoning is arguably more prominent in multi-modal QA benchmarks, e.g., NLVR (Suhr et al., 2017), VQA (Antol et al., 2015), GQA (Hudson and Manning, 2019), CLEVR (Johnson et al., 2017). However, those spatial reasoning phenomena are mostly expressed naturally through images, while this paper focuses on studying spatial reasoning on natural language. Some other works on visual-spatial reasoning are based on geographical information inside maps and diagrams (Huang et al., 2019) and navigational instructions (Chen et al., 2019; Anderson et al., 2018).
As another approach to evaluate spatial reasoning capabilities of models, a dataset proposed in Ghanimifard and Dobnik (2017) generates a synthetic training set of spatial sentences and evaluates the models’ ability to generate spatial facts and sentences containing composition and decomposition of relations on grounded objects.

Second, two student volunteers produced textual description of those objects and their corresponding spatial relationships based on these images. Since the blocks are always horizontally aligned in each NLVR image, to allow for more flexibility, annotators could also rearrange these blocks (see Fig. 1a). Relationships between objects within the same block can take the forms of relative direction (e.g., left or above), qualitative distance (e.g., near or far), and topological relationship (e.g., touching or containing).
However, we instructed the annotators not to describe all objects and relationships, (1) to avoid unnecessarily verbose stories, and (2) to intentionally miss some information to enable more complex reasoning later. Therefore, annotators describe only a random subset of blocks, objects, and relationships.
To query more interesting phenomena, annotators were then encouraged to write questions requiring detecting relations and reasoning over them using multiple spatial rules. A spatial rule can be one of the transitivity (A → B, B → C ⇒ A → C), symmetry (A → B ⇒ B → A), converse ((A, R, B) ⇒ (B, reverse(R), A)), inclusion (obj1 in A), and exclusion (obj1 not in B) rules.


There are four types of questions (Q-TYPE). (1) FR: find relation between two objects. (2) FB: find

To mitigate the aforementioned problems of Task 17 of bAbI, i.e., simple scenes, stories, and questions, we describe the data annotation process of SPARTQA-HUMAN, and explain how those problems were addressed in this section.
First, we randomly selected a subset of NLVR images, each of which has three blocks containing multiple objects (see Fig 1b). The scenes shown by these images are more complicated than those described by bAbI because (1) there are more objects in NLVR images; (2) the spatial relationships in

the block that contains certain object(s). (3) CO: choose between two objects mentioned in the question that meets certain criteria. (4) YN: a yes/no question that tests if a claim on spatial relationship holds.
FB, FR, and CO questions are formulated as multiple-choice questions4 and receive a list of candidate answers, and YN questions’ answer is choosing from Yes, No, or “DK” (Do not Know). The “DK” option is due to the open-world assumption of the stories, where if something is not described

NLVR are not limited to just four relative directions as objects are placed arbitrarily within blocks.

4CO can be considered as both single-choice and multiplechoices question.


Test Train SPARTQA-AUTO: Seen Test Unseen Test Dev Train

104 154
3872 3872 3842 23654

105 149
3712 3721 3742 23302

194 162
3896 3896 3860 23968

107 151
3594 3598 3579 22794

510 616
15074 15087 15023 93673

Table 1: Number of questions per Q-TYPE

in the text, it is not considered as false (See Fig. 2). Finally, annotators were able to create 1.1k QA
pairs on spatial reasoning on the generated descriptions, distributed among the aforementioned types. We intentionally keep this data in a relatively small scale due to two reasons. First, there has been some consensus in our community that modern systems, given their sufficiently large model capacities, can easily find shortcuts and overfit a dataset if provided with a large training data (Gardner et al., 2020; Sen and Saffari, 2020). Second, collecting spatial reasoning QAs is very costly: The two annotators spent 45-60 mins on average to create a single story with 8-16 QA pairs. We estimate that SPARTQA-HUMAN costed about 100 human hours in total. The expert performance on 100 examples of SPARTQA-HUMAN’s test set measured by their accuracy of answering the questions is 92% across four Q-TYPEs on average, indicating its high quality.

blocks and objects from each image programmatically. The benefit is two-fold. First, a random selection of blocks and objects allows us to create multiple stories for each image; second, this randomness also creates spatial reasoning opportunities with missing information.
Once we decide on a set of blocks and objects to be included, we determine their relationships: Those relationships between blocks are generated randomly; as for those between objects, we refer to the ground truth of these images to determine them.
Now we have a scene containing a set of blocks and objects and their associated relationships. To produce a story for this scene, we design CFGs to produce natural language sentences that describe those blocks/objects/relationships in various expressions (see Fig. 3 for two portions of our CFG describing relative and nested relations between objects).

The big black shape is above the medium triangle.

Article Relation Object Size Color Shape Ind_shape

the | a above | left | … * * small | medium | big yellow | blue | black square | triangle | circle shape | object | thing

4 Distant Supervision: SPARTQA-AUTO

(a) Part of the grammar describing relations between objects
The big black shape is above the object that is

Since human annotations are costly, it is impor- to the right of the medium triangle

tant to investigate ways to generate distant supervision signals for spatial reasoning. However, un-


like conventional distant supervision approaches (e.g., Mintz et al. (2009); Zeng et al. (2015); Zhou et al. (2020)) where distant supervision data can be selected from large corpora by implementing


* * | that is

(b) Part of the grammar describing nested relationships.

specialized filtering rules, spatial reasoning does

Figure 3: Two parts of our designed CFG

not appear often in existing corpora. Therefore,

similar to SPARTQA-HUMAN, we take advantage

Being grounded to visual scenes guarantees spa-

of the ground truth of NLVR images, design CFGs tial coherency in a story, and using CFGs helps to

to generate stories, and use spatial reasoning rules have correct sentences (grammatically) and various

to ask and answer spatial reasoning questions. This expressions. We also design context-sensitive rules

automatically generated data is called SPARTQA- to limited options for each CFG’s variable based

AUTO, and below we describe its generation pro- on the chosen entities (e.g. black circle), or what is

cess in detail.

described in the previous sentences (e.g. Block A

has a circle. The circle is below a triangle.) Story generation Since NLVR comes with struc-

tured descriptions of the ground truth locations Question generation To generate questions

of those objects, we were able to choose random based on a passage, there are rule-based sys-


? (obj1 , obj4)
Left (obj1 , obj2) Touching (obj2 , obj3) Right (obj4 , obj2)

1 Obj1 left
Obj4 Obj3 Obj2
~right = left Obj4 Obj3

2 left ^ left => left
3 Left (obj1 , obj4)

in the story. For instance, for the question "is there any blue circle above the big blue triangle?", this module finds all the mentions in the story matching the description “a blue circle”.

Figure 4: Find the implicit relation between obj1 and obj4 by Transitivity rule. (1) Find a set of objects that have a relation with obj1. Continue the same process on the new set until obj4 is found. (2) Get the union of the intermediate relations between these two objects and it is the final answer.

Similar to the SPARTQA-HUMAN, we provide four Q-TYPEs FR, FB, CO, and YN. To generate FR questions, we choose two objects using Choose-objects module and question their relationships. The YN Q-TYPE is similar to FR, but the question specifies one relationship of interest cho-

sen from all relation extracted by Find-all-relations

tems (Heilman and Smith, 2009; Labutov et al., 2015), neural networks (Du et al., 2017), and their combinations (Dhole and Manning, 2020). However, in our approach, during generating each story, the program stores the information about the entities and their relationships. Thus, without processing the raw text, which is error-prone, we generate questions by only looking at the stored data. The question generation operates based on four primary functionalities, Choose-objects, Describe-objects, Find-all-relations, and Find-similar-objects. These modules are responsible to control the logical consistency, correctness, and the number of steps required for reasoning in each question.
Choose-objects randomly chooses up to three objects from the set of possible objects in a story under a set of constraints such as preventing selection of similar objects, or excluding objects with relations that are directly mentioned in the text.
Describe-Objects generates a mention phrase for an object using parts of its full name (presented in the story). The generated phrase is either point-

module to be questioned about the objects. Since most of the time, Yes/No questions are simpler problems, we make this question type more complex by adding quantifiers (adding “all” and “any”). These quantifiers help to evaluates the models’ capability to aggregate relations between more than two objects in the story and do the reasoning over all find relations to find the final answer. In FB Q-TYPE, we mention an object by its indirect relation to another object using the nested relation in Describe-objects module and ask to find the blocks containing or not containing this object. Finally, the CO question selects an anchor object (Choose-objects) and specifies a relationship ( using Find-all-relations) in the question. Two other objects are chosen as candidates to check whether the specified relationship holds between them and the anchor object. We tend to force the algorithm to choose objects as candidates that at least have one relationship to the anchor object. To see more details about different question’ templates see Table 7 in the Appendix.

ing to a unique object or a group of objects such as "the big circle," or "big circles." To describe a unique object, it chooses an attribute or a group of attributes that apply to a unique object among others in the story. To increase the steps of reasoning, the description may include the relationship of the object to other objects instead of using a direct unique description. For example, "the circle which is above the black triangle."

Answer generation We compute all direct and indirect relationships between objects using Findall-relations function and based on the Q-TYPEs generate the final answer.
For instance, in YN Q-TYPE if the asked relation exists in the found relations, the answer is "Yes", if the inverse relation exists it must be "No", and otherwise, it is "DK"5.

Find-all-relations completes the relationship graph between objects by applying a set of spatial rules such as transitivity, symmetry, converse, inclusion, and exclusion on top of the direct relations described in the story. As shown in Fig. 4, it does an exhaustive search over all combinations of

4.1 Corpus Statistics
We generate the train, dev, and test set splits based on the same splits of the images in the NLVR dataset. On average, each story contains 9 sentences (Min:3, Max: 22) and 118 tokens (Min: 66,

the relations that link two objects to each other. Find-similar-objects finds all the mentions
matching a description from the question to objects

5The SPARTQA-AUTO generation code and the file of dataset are available at SpartQA_generation


Max: 274). Also, the average tokens of each ques- where s is the story, ci is the candidate answer, q is

tion (on all Q-TYPE ) is 23 (Min:6, Max: 57).

the question, [ ] indicates the concatenation of the

Table 1 shows the total number of each question listed vectors, and mi is tokens’ number in xi. The

type in SPARTQA-AUTO (Check Appendix to see parameter vector, W , is shared for all candidates.

more statistic information about the labels in Tab 8.)

5.1 Training and Inference

5 Models for Spatial Reasoning over Language

We train the models based on the summation of the cross-entropy losses of all binary classifiers in the architecture. For FR and YN Q-TYPEs, there

This section describes the model architectures on different Q-TYPEs: FR, YN, FB, and CO. All QTYPEs can be cast into a sequence classification task, and the three transformer-based LMs tested in this paper, BERT (Devlin et al., 2019), ALBERT (Lan et al., 2020), and XLNet (Yang et al., 2019), can all handle this type of tasks by classifying the representation of [CLS], a special token prepended to each target sequence (see Appendix E). Depending on the Q-TYPE, the input sequence and how

are multiple classifiers, while there is only one classifier used for CO and FB Q-TYPEs.
We remove inconsistent answers in postprocessing for FR and YN Q-TYPEs during inference phase. For instance on FR, left and right relations between two objects cannot be valid at the same time. For YN, as there is only one valid answer amongst the three candidates, we select the candidate with the maximal predicted probability of being the true answer.

we do inference may be different. FR and YN both have a predefined label set as

6 Experiments

candidate answers, and their input sequences are As fine-tuning LMs has become a common base-

both the concatenation of a story and a question. line approach to knowledge transfer from a source

While the answer to a YN question is a single label dataset to a target task, including but not limited

chosen from Yes, No, and DK, FR questions can to Phang et al. (2018); Zhou et al. (2020); He et al.

have multiple correct answers. Therefore, we treat (2020b), we study the capability of spatial reason-

each candidate answer to FR as an independent ing of modern LMs, specifically BERT, ALBERT,

binary classification problem, and take the union and XLNet, after fine-tuning them on SPARTQA-

as the final answer. As for YN, we choose the label AUTO. This fine-tuning process is also known as

with the highest confidence (Fig 8b).

further pretraining, to distinguish with the fine-

As the candidate answers to FB and CO are not tuning process on one’s target task. It is an open

fixed and depend on each story and its question problem to find out better transfer learning tech-

the input sequences to these Q-TYPEs are con- niques than simple further pretraining, as suggested

catenated with each candidate answer. Since the in He et al. (2020a); Khashabi et al. (2020), which

defined YN and FR model has moderately less ac- is beyond the scope of this work. All experi-

curate results on FB and CO Q-TYPEs, we add a ments use the models proposed in Sec. 5. We

LSTM (Hochreiter and Schmidhuber, 1997) layer to improve it. Hence, to find the final answer, we run the model with each candidate answer and then

use AdamW (Loshchilov and Hutter, 2017) with 2 × 10−6 learning rate and Focal Loss (Lin et al., 2017) with γ = 2 for training all the models.6

apply an LSTM layer on top of all token representations. Then, we use the last vector of the LSTM outputs for classification (Fig 8a). The final an-

6.1 Further pretraining on SPARTQA-AUTO improves spatial reasoning

swers are selected based on Eq. (1).

Table 2 shows performance on SPARTQA-HUMAN

in a low-resource setting, where 0.6k QA pairs

xi = [s, ci, q]

from SPARTQA-HUMAN are used for fine-tuning these LMs and 0.5k for testing (see Table 1 for

Ti = [ti1, ..., timi] = LM (xi)

information on this split).7 During our annotation,

[hi1, ..., himi] = LSTM(Ti)

we found that the description of “near to ” and “far (1)

yi = [yi0, yi1] = Softmax(himTi W ))

6All codes are available at HLR/SpartQA-baselines

Answer = {ci| arg max(yij) = 1}

7Note this low-resource setting can also be viewed as a spatial reasoning probe to these LMs (Tenney et al., 2019).


# Model


1 Majority

28.84 24.52 40.18 53.60 36.64


16.34 20 26.16 45.36 30.17

3 BERT (Stories only; MLM)

21.15 16.19 27.1 51.54 32.90

4 BERT (SPARTQA-AUTO; MLM) 19.23 29.54 32.71 47.42 34.88


62.5 46.66 32.71 47.42 47.25

6 Human

91.66 95.23 91.66 90.69 92.31

Table 2: Further pretraining BERT on SPARTQA-AUTO improves accuracies on SPARTQA-HUMAN. All systems are fine-tuned on the training data of SPARTQA-HUMAN, but Systems 3-5 are also further pretrained in different ways. System 3: further pretrained on the stories from SPARTQA-AUTO as a masked language model (MLM) task. System 4: further pretrained on both stories and QA annotations as MLM. System 5: the proposed model that is further pretrained on SPARTQA-AUTO as a QA task. Avg: The micro-average on all four Q-TYPEs.

from” varies largely between annotators. Therefore, we ignore these two relations from FR Q-TYPE in our evaluations.
In Table 2, System 5, BERT (SPARTQA-AUTO),

A big circle is above a triangle. A blue square is below the triangle.
What is the relation between the circle and the blue object? Answer: Above

is the proposed method of further pretraining

BERT on SPARTQA-AUTO. We can see that System 2, the original BERT, performs consistently lower than System 5, indicating that hav-

A big circle is above a triangle. A blue square is below the triangle. The circle is [MASK] the blue object. Answer: Above

ing SPARTQA-AUTO as a further pretraining task improves BERT’s spatial understanding.

Figure 5: Convert a triplet of (paragraph, question, answer) into a single piece of text for the MLM task.







BERT (Stories only; MLM)





icant gap between System 3 and the proposed System 5 indicates that supervision signals come more from our annotations in SPARTQA-AUTO rather than from seeing more unannotated text. System 4 is another way to make use of the annotations in

Table 3: Switching from accuracy in Table 2 to F1 shows that the models are all performing better than the majority baseline on YN Q-TYPE.

SPARTQA-AUTO, but it is shown to be not as effective as further pretraining BERT on SPARTQAAUTO as a QA task.

In addition, we implement another two baselines. System 3, BERT (Stories only; MLM): further pretraining BERT only on the stories of SPARTQAAUTO as a masked language model (MLM) task; System 4, BERT (SPARTQA-AUTO; MLM): we convert the QA pairs in SPARTQA-AUTO into textual statements and further pretrain BERT on the text as an MLM (see Fig. 5 for an example conversion).
To convert each question and its answer into a sentence, we utilize static templates for each question type which removes the question words and rearranges other parts into a sentence.
We can see that System 3 slightly improves over System 2, an observation consistent with many

While the proposed System 5 overall performs better than the other three baseline systems, one exception is its accuracy on YN, which is lower than that of System 3. Since all systems’ YN accuracies are also lower than the majority baseline8, we hypothesize that this is due to imbalanced data. To verify it, we compute the F1 score for YN Q-TYPE in Table 3, where we see all systems effectively achieve better scores than the majority baseline. However, further pretraining BERT on SPARTQAAUTO still does not beat other baseline systems, which implies that straightforward pretraining is not necessarily helpful in capturing the complex reasoning phenomena required by YN questions.
The human performance is evaluated on 100 ran-

prior works that seeing more text generally helps an LM (e.g., Gururangan et al. (2020)). The signif-

8which predicts the label that is most common in each set of SPARTQA


# Models





Seen Unseen Human* Seen Unseen Human* Seen Unseen Human* Seen Unseen Human*

1 Majority 48.70 48.70 28.84 40.81 40.81 24.52 20.59 20.38 40.18 49.94 49.91 53.60

2 BERT 87.13 69.38 62.5 85.68 73.71 46.66 71.44 61.09 32.71 78.29 76.81 47.42

3 ALBERT 97.66 83.53 56.73 91.61 83.70 44.76 95.20 84.55 49.53 79.38 75.05 41.75

4 XLNet 98.00 84.85 73.07 94.60 91.63 57.14 97.11 90.88 50.46 79.91 78.54 39.69

5 Human









Table 4: Spatial reasoning is challenging. We further pretrain three transformer-based LMs, BERT, ALBERT, and XLNet, on SPARTQA-AUTO, and test their accuracy in three ways: Seen and Unseen are both from SPARTQAAUTO, where Unseen has applied minor modifications to its vocabulary; to get those Human columns, all models are fine-tuned on SPARTQA-HUMAN’s training data. Human performance on Seen and Unseen is the same since the changes applied to Unseen does not affect human reasoning.

dom questions from each SPARTQA-AUTO and et al., 2010; Dan et al., 2020; Rahgooy et al., 2018)

SPARTQA-HUMAN test set. The respondents are to understand stories and questions better.

graduate students that were trained by some examples of the dataset before answering the final questions. We can see from Table 2 that all systems’ performances fall behind human performance by a large margin. We expand on the difficulty of SPARTQA in the next subsection.

To evaluate the reliability of the models, we also provide two extra consistency and contrast test sets. Consistency set is made by changing a part of the question in a way that seeks for the same information (Hudson and Manning, 2019; Suhr et al., 2019). Given a pivot question and answer of a spe-

cific consistency set, answering other questions in

6.2 SPARTQA is challenging

the set does not need extra reasoning over the story.

In addition to BERT, we continue to test another two LMs, ALBERT and XLNet (Table 5). We further pretrain these LMs on SPARTQA-AUTO, and test them on SPARTQA-HUMAN (the numbers of BERT are copied from Table 2) and two held-out test sets of SPARTQA-AUTO, Seen and Unseen. Note that when a system is tested against SPARTQA-HUMAN, it is fine-tuned on SPARTQAHUMAN’s training data following its further pretraining on SPARTQA-AUTO. We use the unseen set to test to what extent the baseline models use shortcuts in the language surface. This set applies minor modifications randomly on a number of stories and questions to change the names of shapes, colors, sizes, and relationships in the vocabulary of

Contrast set is made by minimal modification in a question to change its answer (Gardner et al., 2020). For contrast sets, there is a need to go back to the story to find the new answer for the question’s minor variations (see Appendix C.2 for examples.) The consistency and contrast sets are evaluated only on the correctly predicted questions to check if the actual understanding and reasoning occurs. This ensures the reliability of the models.
Table 5 shows the result of this evaluation on four Q-TYPEs of SPARTQA-AUTO, where we can see, for another time, that the high scores on the Seen test set are likely due to overfitting on training data rather than correct detection of spatial terms and reasoning over them.

the stories, which do not influence the reasoning steps (more details in Appendix C.1).

6.3 Extrinsic evaluation

All models perform worst in YN across all QTYPEs, which suggests that YN presents a more complex phenomena, probably due to additional quantifiers in the questions. XLNet performs

In this subsection, we take BERT as an example to show, once pretrained on SPARTQA-AUTO, BERT can achieve better performance on two extrinsic evaluation datasets, namely bAbI and boolQ.

the best on all Q-TYPEs except its accuracy on

We draw the learning curve on bAbI, using the

SPARTQA-HUMAN’s YN section. However, the original BERT as a baseline and BERT further pre-

drops in Unseen and human suggest overfitting on trained on SPARTQA-AUTO (Fig. 6). Although

the training vocabulary. The low accuracies on hu- both systems achieve perfect accuracy given large

man test set from all models show that solving this enough training data (i.e., 5k and 10k), BERT

benchmark is still a challenging problem and re- (SPARTQA-AUTO) is showing better scores given

quires more sophisticated methods like considering less training data. Specifically, to achieve an accu-

spatial roles and relations extraction (Kordjamshidi racy of 99%, BERT (SPARTQA-AUTO) requires



FB Consistency
69.44 84.77 85.2

FR Consistency
76.13 82.42 88.56

Contrast 42.47 41.69 50

CO Consistency
16.99 58.42 71.10

Contrast 15.58 62.51 72.31

YN Consistency
48.07 48.78 51.08

Contrast 71.41 69.19 69.18

Table 5: Evaluation of consistency and semantic sensitivity of models in Table 4. All the results are on the correctly predicted questions of Seen test set of SPARTQA-AUTO.

GLUE (Wang et al., 2018).

We observe that many of the boolQ examples

answered correctly by the BERT further pretrained

on SPARTQA-AUTO require multi-step reasoning.

Our hypothesis is that since solving SPARTQA-

AUTO questions needs multi-step reasoning, fine-

tuning BERT on SPARTQA-AUTO generally im-

Figure 6: Learning curve of BERT and BERT further proves this capability of the base model.

pretrained on SPARTQA-AUTO on bAbI.

7 Conclusion

Model Majority baseline Recurrent model (ReM) ReM fine-tuned on SQuAD ReM fine-tuned on QNLI ReM fine-tuned on NQ BERT (our setup) BERT (SPARTQA-AUTO)

Accuracy 62.2 62.2 69.8 71.4 72.8 71.9 74.2

Spatial reasoning is an important problem in natural language understanding. We propose the first human-created QA benchmark on spatial reasoning, and experiments show that state-of-the-art pretrained language models (LM) do not have the capability to solve this task given limited training data, while humans can solve those spatial reasoning questions reliably. To improve LMs’ capability on

Table 6: System performances on the dev set of boolQ (since the test set is not available to us). Top: numbers

this task, we propose to use hand-crafted grammar and spatial reasoning rules to automatically gener-

reported in (Clark et al., 2019). Bottom: numbers from our experiments. BERT (SPARTQA-AUTO): further pretraining BERT on SPARTQA-AUTO as a QA task.

ate a large corpus of spatial descriptions and corresponding question-answer annotations; further pretraining LMs on this distant supervision dataset

significantly enhances their spatial language un-

1k training examples, while BERT requires twice as much. We also notice that BERT (SPARTQAAUTO) converges faster in our experiments.

derstanding and reasoning. We also show that a spatially-improved LM can have better results on two extrinsic datasets (bAbI and boolQ).

As another evaluation dataset, we chose boolQ for two reasons. First, we needed a QA dataset


with Yes/No questions. To our knowledge boolQ is the only available one used in the recent work. Second, indeed, SPARTQA and boolQ are from different domains, however, boolQ needs multi-step reasoning in which we wanted to see if SPARTQA helps.
Table 6 shows that further pretraining BERT on SPARTQA-AUTO yields a better result than the

This project is supported by National Science Foundation (NSF) CAREER award #2028626 and (partially) supported by the Office of Naval Research grant #N00014-20-1-2005. We thank the reviewers for their helpful comments to improve this paper and Timothy Moran for his help in the human data generation.

original BERT and those reported numbers in Clark et al. (2019), which also tested on various distant


supervision signals such as SQuAD (Rajpurkar et al., 2016), Google’s Natural Question dataset NQ (Kwiatkowski et al., 2019), and QNLI from

Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton van den Hengel. 2018. Vision-


and-language navigation: Interpreting visuallygrounded navigation instructions in real environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3674–3683.

Kaustubh Dhole and Christopher D. Manning. 2020. Syn-QG: Syntactic and shallow semantic rules for question generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 752–765.

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433.
Howard Chen, Alane Suhr, Dipendra Misra, Noah Snavely, and Yoav Artzi. 2019. TOUCHDOWN: Natural language navigation and spatial reasoning in visual street environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 12538–12547.

Xinya Du and Claire Cardie. 2020. Event extraction by answering (almost) natural questions. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP).
Xinya Du, Junru Shao, and Claire Cardie. 2017. Learning to Ask: Neural question generation for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1342– 1352.

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924–2936.

Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2368–2378.

Douglas H Clements and Michael T Battista. 1992. Geometry and spatial reasoning. Handbook of research on mathematics teaching and learning, pages 420– 464.
Soham Dan, Parisa Kordjamshidi, Julia Bonn, Archna Bhatia, Zheng Cai, Martha Palmer, and Dan Roth. 2020. From spatial relations to spatial configurations. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 5855– 5864, Marseille, France. European Language Resources Association.
Pradeep Dasigi, Nelson F. Liu, Ana Marasovic´, Noah A. Smith, and Matt Gardner. 2019. Quoref: A reading comprehension dataset with questions requiring coreferential reasoning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5925–5932.
Surabhi Datta and Kirk Roberts. 2020. A hybrid deep learning approach for spatial trigger extraction from radiology reports. In Proceedings of the Third International Workshop on Spatial Language Understanding, pages 50–55, Online. Association for Computational Linguistics.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.

Matt Gardner, Yoav Artzi, Victoria Basmov, Jonathan Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, Nitish Gupta, Hannaneh Hajishirzi, Gabriel Ilharco, Daniel Khashabi, Kevin Lin, Jiangming Liu, Nelson F. Liu, Phoebe Mulcaire, Qiang Ning, Sameer Singh, Noah A. Smith, Sanjay Subramanian, Reut Tsarfaty, Eric Wallace, Ally Zhang, and Ben Zhou. 2020. Evaluating models’ local decision boundaries via contrast sets. In Findings of the Association for Computational Linguistics: EMNLP 2020.
Matt Gardner, Jonathan Berant, Hannaneh Hajishirzi, Alon Talmor, and Sewon Min. 2019. Question Answering is a Format; when is it useful? ArXiv, abs/1909.11291.
Mehdi Ghanimifard and Simon Dobnik. 2017. Learning to compose spatial relations with grounded neural language models. In IWCS 2017-12th International Conference on Computational SemanticsLong papers.
Suchin Gururangan, Ana Marasovic´, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A. Smith. 2020. Don’t Stop Pretraining: Adapt language models to domains and tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8342–8360.
Hangfeng He, Qiang Ning, and Dan Roth. 2020a. QuASE: Question-answer driven sentence encoding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8743–8758, Online. Association for Computational Linguistics.


Preparing to load PDF file. please wait...

0 of 0
SPARTQA: A Textual Question Answering Benchmark for Spatial