How heavy is an elephant? How expensive is a wedding ring?

Humans have a pretty good sense of *scale*, or reasonable ranges of these

*numeric attributes*, of different objects, but do pre-trained language

representations? Although pre-trained Language Models (LMs) like

BERT have

shown a remarkable ability to learn all kinds of knowledge, including

factual

knowledge,

it remains unclear whether their representations can capture these types

of numeric attributes from text alone without explicit training data.

In our recent

paper,

we measure the amount of scale information that is captured in several

kinds of pre-trained text representations and show that, although

generally a **significant amount** of such information is captured, there is

still a **large gap** between their current performance and the theoretical

upper bound. We identify that specifically those text representations

that are **contextual** and **good at numerical reasoning** capture scale

better. We also come up with a **new version of BERT**, called *NumBERT*, with

improved numerical reasoning by **replacing numbers in the pretraining
text corpus with their scientific notation**, which more readily exposes

the magnitude to the model, and demonstrate that NumBERT representations

capture scale significantly better than all those previous text

representations.

# Scalar Probing

In order to understand to what extent pre-trained text representations, like

BERT representations, capture scale information, we propose a task

called *scalar probing*: probing the ability to predict a

*distribution* over values of a scalar attribute of an object. In this

work, we focus specifically on three kinds of scalar attributes: weight,

length, and price.

Here is the basic architecture of our scalar probing task:

In this example, we are trying to see whether the representation of

“dog” extracted by a pre-trained encoder can be used to predict/recover

the distribution of the weight of a dog through a linear model. We probe

three baseline language representations:

Word2vec,

ELMo,

and

BERT.

Since the latter two are contextual representations that operate on

sentences instead of words, we feed in sentences constructed using fixed

templates. For example, for weight, we use the template “The X is

heavy”, where X is the object in interest.

We explore the kind of probe that predicts a *point estimate* of the value

and the kind that predicts the *full distribution*. For predicting a point

estimate, we use a standard linear **R**e**GR**ession (we denote as “**rgr**”)

trained to predict the log of the median of all values for each object

for the scale attribute under consideration. We predict the log because,

again, we care about the general scale rather than the exact value. The

loss is calculated using the prediction and the log of the median of the

ground-truth distribution. For predicting the full distribution, we use

a linear softmax **M**ulti-**C**lass **C**lassifier (we denote as “**mcc**”) producing a

categorical distribution over the 12 orders of magnitude. The

categorical distribution predicted using the NumBERT (our improved

version of BERT; will be introduced later) representations is shown as

the orange histogram in the above example.

The ground-truth distributions we use come from the Distributions over

Quantities (DoQ)

dataset which consists of *empirical counts* of scalar attribute values

associated with >350K nouns, adjectives, and verbs over 10 different

attributes, *automatically extracted* from a large web text corpus. Note

that during the construction of the dataset, all units for a certain

attribute are first unified to a single one (e.g.

centimeter/meter/kilometer -> meter) and the numeric values are scaled

accordingly. We convert the collected counts for each object-attribute

pair in DoQ into a *categorical distribution over 12 orders of magnitude*.

In the above example of the weight of a dog, the ground-truth

distribution is shown as the grey histogram, which is concentrated

around 10-100kg.

**The better the predictive performance is across all the object-attribute
pairs we are dealing with, the better the pre-trained representations
encode the corresponding scale information.**

# NumBERT

Before looking at the scalar probing results of these different language

presentations, let’s also think about what kind of representations might

be good at capturing scale information and how to improve existing LMs

to capture scale better. All of these models are trained using large

online text corpora like Wikipedia, news, etc. How can their

representations pick up scale information from all this text?

Here is a piece of text from the first document I got when I searched on

Google “elephant weight”:

“…African elephants can range from 5,000 pounds to more than 14,000 pounds (6,350 kilograms)…”

So it is highly likely that **the learning of scale is partly mediated by
the transfer of scale information from the numbers** (here “5,000”,

“14,000”, etc.)

**to nouns**(here “elephants”) and

**numeracy**, i.e. the

ability to reason about numbers,

**is probably important for representing**

scale!

scale

However, previous

work has

shown that existing pre-trained text representations, including BERT,

ELMo, and Word2Vec, are not good at reasoning over numbers. For example,

beyond the magnitude of ~500, they cannot even decode a number from its

word embedding, e.g. embedding(“710”) 710. Thus, we propose to improve

the numerical reasoning abilities of these representations by replacing

every instance of a number in the LM training data with its *scientiﬁc
notation*, and re-pretraining BERT (which we call

*NumBERT*). This enables

the model to more easily associate objects in the sentence directly with

the

*magnitude*expressed in the

*exponent*, ignoring the relatively

insigniﬁcant mantissa.

# Results

### Scalar Probing

The above table shows the results of scalar probing on the DoQ data. We

use three evaluation metrics: *Accuracy*, *Mean Squared Error (MSE)*, and

*Earth Mover’s distance (EMD)*, and we do the experiments in four domains:

*Lengths*, *Masses*, *Prices* and *Animal Masses* (a subset of Masses). For MSE

and EMD, the best possible score is 0, while we compute a loose *upper
bound* of accuracy by sampling from the ground-truth distribution and

evaluating against the mode. This upper bound achieves accuracies of

0.570 for lengths, 0.537 for masses, and 0.476 for prices.

For the *Aggregate* baseline, for each attribute, we compute the empirical

distribution over buckets across all objects in the training set, and

use that as the predicted distribution for all objects in the test set.

Compared with this baseline, we can see that the **mcc** probe over the best

text representations capture about **half** (as measured by accuracy) to **a
third** (by MSE and EMD) of the distance to the upper bound mentioned

above, suggesting that

**while a signiﬁcant amount of scalar information**

is available, there is a long way to go to support robust commonsense

reasoning.

is available, there is a long way to go to support robust commonsense

reasoning

Specifically, **NumBERT representations do consistently better than all
the others** on

*Earth Mover’s Distance*(EMD), which is the

*most*

robustmetric because of its better convergence

robust

properties and

robustness to adversarial perturbations of the data

distribution.

**Word2Vec**

performs signiﬁcantly worse than the contextual representations– even

performs signiﬁcantly worse than the contextual representations

though the task is

*noncontextual*(since we do not have different

ground-truths for an object occurring in different contexts in our

setting). Also, despite being weaker than BERT on downstream NLP tasks,

**ELMo does better on scalar probing**, consistent with it being better at

numeracy due

to its

*character-level tokenization*.

### Zero-shot transfer

We note that DoQ is derived heuristically from web text and contains

noise. So we also evaluate probes trained on DoQ on 2 datasets

containing *ground truth labels* of scalar attributes:

VerbPhysics and

Amazon Price

Dataset.

The ﬁrst is a human labeled dataset of relative comparisons, e.g.

(person, fox, weight, bigger). Predictions for this task are made by

comparing the point estimates for **rgr** and highest-scoring buckets for

**mcc**. The second is a dataset of empirical distributions of product

prices on Amazon. We retrained a probe on DoQ prices using 12 power-of-4

buckets to support ﬁner grained predictions.

The results are shown in the tables above. On VerbPhysics (the table on

the top), **rgr**+NumBERT performed best, approaching the performance of

using DoQ as an oracle, though short of specialized

models for

this task. Scalar probes trained with **mcc** perform poorly, possibly

because a ﬁner-grained model of predicted distribution is not useful for

the 3-class comparative task. On the Amazon Price Dataset (the table on

the bottom) which is a full distribution prediction task, **mcc**+NumBERT did

best on both distributional metrics. On both zero-shot transfer tasks,

**NumBERT representations were the best** across all conﬁgurations of

metrics/objectives, suggesting that manipulating numeric representations

of the text in the pre-training corpora can signiﬁcantly improve

performance on scale prediction.

# Moving Forward

In the work above, we introduce a new task called *scalar probing* used to

measure how much information of numeric attributes of objects

pre-trained text representations have captured and find out that while

there is a **significant amount of scale information** in object

representations (half to a third to the theoretical upper bound), these

models are **far from achieving common sense scale understanding**. We also

come up with an **improved version of BERT**, called *NumBERT*, whose

representations **capture scale information significantly better** than all

the previous ones.

Scalar probing opens up new exciting research directions to explore. For

example, lots of work has pre-trained large-scale *vision & language
models*, like

ViLBERT and

CLIP.

Probing their representations to see how much scale information has been

captured and performing systematic comparisons between them and

representations learned by language-only models can be quite

interesting.

Also, models learning text representations that predict scale better can

have a **great real-world impact**. Consider a web query like:

“How tall is the tallest building in the world?”

With a common sense understanding of what a reasonable range of heights

for “building” is, we can detect errors in the current web QA system when there are mistakes in

retrieval or parsing, e.g. when a wikipedia sentence about a building is

mistakenly parsed as being 19 miles high instead of meters.

Check out the paper Do Language Embeddings Capture

Scales? by

Xikun Zhang, Deepak Ramachandran, Ian Tenney, Yanai Elazar, and Dan

Roth.