What Matters in Memorizing and Recalling Facts? Multifaceted Benchmarks for Knowledge Probing in Language Models (2025)

Table of Contents
1 Introduction 2 BELIEF Benchmark 2.1 Preliminaries 2.2 Accuracy and its fluctuations 2.3 Consistency 2.4 Reliability 3 BELIEF-ICL for Decoder-based LLMs 3.1 In-context learning for fact probe 3.2 Evaluation methods 4 MyriadLAMA Dataset 4.1 Dataset construction 4.2 Dataset Statistics 5 Effectiveness of BELIEFs 5.1 Experimental setups 5.2 Do BELIEFs provide additional insights? 5.3 Does ICL adhere to instructions? 5.4 Can BELIEFs mitigate bias? 6 Differentiating PLMs in Fact Probing 6.1 Factors affecting the recall accuracy 6.2 Factors affecting the reliability 6.3 Factors affecting the robustness 6.4 How do PLMs perceive facts differently? 7 Limitation of Prompt-based Probing 8 Related Work 9 Conclusions 10 Limitations Acknowledgements References Appendix A Construction of MyriadLAMA A.1 The extension of entities A.2 Diversification of relation templates A.3 Example of extended relational templates in MyriadLAMA Appendix B The Advangage of MyriadLAMA B.1 Diversity comparison B.2 Quality comparison Appendix C Ablation Analysis of MyriadLAMA C.1 Validity of extended entity expressions C.2 Validity of paraphrased templates C.3 What matters to robustness? Diverse subject vs. templates C.4 Manually rewritten vs. auto-generated templates Appendix D QA-Style ICL and Its Evaluation D.1 QA-style instruction D.2 Evaluation Appendix E Examples of BELEIF-ICL Prompts E.1 zero-shot E.2 4-random E.3 4-relation E.4 4-template Appendix F Experimental Details F.1 Model cards F.2 Model differences F.3 Evaluation results on all PLMs based on BELIEFs F.4 Knowledge coverage rate on all PLMs

Xin Zhao
The University of Tokyo
xzhao@tkl.iis.u-tokyo.ac.jp
&Naoki Yoshinaga   Daisuke Oba
Institute of Industrial Science,
The University of Tokyo
{ynaga,oba}@iis.u-tokyo.ac.jp
Currently, he works for ELYZA, Inc.

Abstract

Language models often struggle with handling factual knowledge, exhibiting factual hallucination issue.This makes it vital to evaluate the models’ ability to recall its parametric knowledge about facts.In this study, we introducea knowledge probing benchmark,BELIEF(-ICL), to evaluate the knowledge recall ability of both encoder- and decoder-based pre-trained language models (PLMs) from diverse perspectives.BELIEFs utilize a multi-prompt dataset to evaluate PLM’s accuracy, consistency, and reliability in factual knowledge recall.To enable a more reliable evaluation with BELIEFs, we semi-automatically create MyriadLAMA, which has massively diverse prompts.We validate the effectiveness of BELIEFs in comprehensively evaluating PLM’s knowledge recall ability on diverse PLMs, including recent large language models (LLMs).We then investigate key factors in memorizing and recalling facts in PLMs, such as model size, pretraining strategy and corpora, instruction-tuning process and in-context learning settings.Finally, we reveal the limitation of the prompt-based knowledge probing.The MyriadLAMA is publicized.111https://huggingface.co/datasets/iszhaoxin/MyriadLAMA

What Matters in Memorizing and Recalling Facts?
Multifaceted Benchmarks for Knowledge Probing in Language Models


Xin ZhaoThe University of Tokyoxzhao@tkl.iis.u-tokyo.ac.jpNaoki Yoshinaga   Daisuke Obathanks: Currently, he works for ELYZA, Inc.Institute of Industrial Science,The University of Tokyo{ynaga,oba}@iis.u-tokyo.ac.jp


1 Introduction

One of the strongest motivations for training a language model (LM) using massive text is to increase the ability to handle factual knowledgeKamalloo etal. (2023).However, even if LMs are trained on massive text,they suffer from hallucinations that generate incorrect knowledge-grounded sentencesZhang etal. (2023).Considering that large LMs (LLMs) are being widely applied to real-world tasks, it is vital to evaluate the ability to recall the LLMs’ parametric knowledge and what factors influence on memorizing facts during pre-training.

However, evaluating the LLM’s knowledge recall ability is still challenging.Although LAMA probePetroni etal. (2019) evaluates the knowledge stored in pre-trained LMs (PLMs), it provides only prediction accuracy.Some studies diversify prompts in the LAMA probe to compute prediction consistency (robustness)Elazar etal. (2021); Jiang etal. (2020), but those datasets have either low quality or low quantity issues (§B).Moreover, since the LAMA probe assumes encoder-based PLMs with the masked LM objective to solve fill-in-the-blank tasks, directly applying it to decoder-based LLMs will underestimate their knowledge recall ability.Although recent studies leveraged QA datasets to probe LLMs’ knowledgeKalo and Fichtel (2022); Mallen etal. (2023); Wiland etal. (2024); Maekawa etal. (2024), theyoverlook other important aspects than prediction accuracy such as robustness to diverse prompts and the reliability of predictions, which are important for real-world applications.

What Matters in Memorizing and Recalling Facts? Multifaceted Benchmarks for Knowledge Probing in Language Models (1)

In this study, we introduce a multifaceted benchmark for knowledge probing, BELIEFs (Figure1), including BELIEF (§2) and BELIEF-ICL (§3) for encoder- and decoder-based PLMs.BELIEFs utilize diverse prompts for each fact to account for the impact of linguistic expressions when evaluating LLMs’ knowledge recall ability.This allows us to evaluate the robustness and reliability of LLM knowledge by measuring fluctuations in accuracy, consistency, and overconfidence in fact prediction.Since BELIEFs require a multi-prompt probing dataset with diverse prompts for each fact, we build a new probing dataset, MyriadLAMA, to enable a more accurate and comprehensive evaluation (§4).MyriadLAMA expands LAMA-UHNPetroni etal. (2020) by offering different prompts for each fact through a semi-automatic method.Specifically, we obtain a wide variety of lexically, syntactically, and semantically diverse prompts by rewriting relational templates and extending subject expressions.

We applied BELIEFs to various encoder- and decoder-based PLMs, including BERTDevlin etal. (2019) and Llama3Dubey etal. (2024)5.1).Through extensive evaluations, we verify the utility of BELIEFs in uncovering PLMs’ knowledge recall ability (§5).Moreover, by comparing different PLMs, we gain insights into the factors affecting knowledge recall of PLMsfrom three aspects: accuracy, reliability and robustness (§6).

The primary findings in this study are as follows:

  • Model size, pretraining strategy, and corpora are crucial factors for memorizing knowledge in LMs during pretraining.

  • Whereas instruction-tuning enhances LLMs’ ability to follow instructions in BELIEF-ICL, it reduces their knowledge recall ability.

  • The inclusion and selection of demonstrations impact knowledge recall, revealing the gap between memorized and recallable facts.

  • Exploring the upper limits of covered knowledge by various methods reveals the limitation of prompt-based knowledge probing (§7).

2 BELIEF Benchmark

We first present the multifaceted factual probing benchmark, BELIEF for encoder-based PLMs.Using a multi-prompt probing dataset, BELIEF evaluates the knowledge recall ability of PLMs from accuracy, robustness and reliability (§2.2-2.4).Here, robustness measures PLMs’ ability to maintain consistent accuracy and predictions when given different prompts in evaluation.Reliability reflects the extent to which we can trust the PLMs’ predictions.

2.1 Preliminaries

To evaluate the facts in PLMs, BELIEF aggregates results from multiple prompts for each fact to mitigate biases from specific linguistic expressions.This requires varied expressions for each fact, namely multi-prompt factual probing dataset.

We assume the fill-in-the-blank settings, where each fact is represented as a knowledge triple \langlesubject, relation, object\rangle (e.g., \langleTokyo, Capital, Japan\rangle).To probe PLMs for a knowledge triple, we first create a masked prompt (hereafter, prompt) (e.g., “Tokyo is the capital of [Y]”) for it and then input it into PLMs to see if they correctly predict the object token.To create such prompts, we first need a template for the relation (hereafter, relational template, e.g., [X] is the capital of [Y]).We then fill the template with target knowledge triples, replacing [X] with a subject expression and [Y] with a [mask] token.A multi-prompt dataset offers diverse prompts for each fact by providing varied relational templates and entity expressions.

We denote the subject-relation pairs in dataset as T𝑇Titalic_T, the set of prompts for a given subject-relation pair tT𝑡𝑇t\in{T}italic_t ∈ italic_T as Ptsubscript𝑃𝑡P_{t}italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.If the output distribution corresponding to mask token of a prompt is 𝒪={(wj,oj)|joj=1}𝒪conditional-setsubscript𝑤𝑗subscript𝑜𝑗subscript𝑗subscript𝑜𝑗1\mathcal{O}=\{(w_{j},o_{j})|\sum_{j}o_{j}=1\}caligraphic_O = { ( italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) | ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = 1 }, the prediction result is defined as the token w^=argmaxwj,(wj,oj)𝒪oj^𝑤subscriptargmaxsubscript𝑤𝑗subscript𝑤𝑗subscript𝑜𝑗𝒪subscript𝑜𝑗\hat{w}=\text{argmax}_{w_{j},(w_{j},o_{j})\in\mathcal{O}}o_{j}over^ start_ARG italic_w end_ARG = argmax start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , ( italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∈ caligraphic_O end_POSTSUBSCRIPT italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.

2.2 Accuracy and its fluctuations

To correctly evaluate the accuracy of PLMs, we aggregate predictions from diverse prompts.Specifically, we randomly select one prompt for each subject-relation pair tT𝑡𝑇t\in Titalic_t ∈ italic_T to form a set of prompts for all triples P={p1,,p|T|}𝑃subscript𝑝1subscript𝑝𝑇P=\{p_{1},...,p_{|T|}\}italic_P = { italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT | italic_T | end_POSTSUBSCRIPT }.By feeding these prompts P𝑃Pitalic_P to PLMs, we can calculate one accuracy value based on their predictions.We repeat this process to collect a set of accuracies, which we then use to calculate both average and fluctuation.

Average accuracy: In BELIEF, accuracy metrics include Acc@1, which measures the rate of prompts with the correct token predicted within the top-1111 output probabilities.Then we repeat this process N𝑁Nitalic_N times to obtain a set of accuracies, denoted as VAcc@1subscript𝑉Acc@1V_{\textrm{Acc}@1}italic_V start_POSTSUBSCRIPT Acc @ 1 end_POSTSUBSCRIPT, where |VAcc@1|=Nsubscript𝑉Acc@1𝑁|V_{\textrm{Acc}@1}|=N| italic_V start_POSTSUBSCRIPT Acc @ 1 end_POSTSUBSCRIPT | = italic_N.The final average accuracy is calculated as the mean value of VAcc@1subscript𝑉Acc@1V_{\textrm{Acc}@1}italic_V start_POSTSUBSCRIPT Acc @ 1 end_POSTSUBSCRIPT.

Fluctuation of accuracy: For VAcc@1subscript𝑉Acc@1V_{\textrm{Acc}@1}italic_V start_POSTSUBSCRIPT Acc @ 1 end_POSTSUBSCRIPT, we evaluate accuracy fluctuations using the range and standard deviation (SD).The range is determined by the difference between the maximum and minimum accuracy values in VAcc@1subscript𝑉Acc@1V_{\textrm{Acc}@1}italic_V start_POSTSUBSCRIPT Acc @ 1 end_POSTSUBSCRIPT.

2.3 Consistency

For each subject-relation pair t𝑡titalic_t, we assess the PLM’s consistency in predicting the object across different prompts in Ptsubscript𝑃𝑡P_{t}italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.Specifically, we compute the degree of match between the prediction result w^tisuperscriptsubscript^𝑤𝑡𝑖\hat{w}_{t}^{i}over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT for a given prompt ptisuperscriptsubscript𝑝𝑡𝑖p_{t}^{i}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and the prediction results w^tjsuperscriptsubscript^𝑤𝑡𝑗\hat{w}_{t}^{j}over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT for other prompts ptjPtsuperscriptsubscript𝑝𝑡𝑗subscript𝑃𝑡p_{t}^{j}\in P_{t}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∈ italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (where ji𝑗𝑖j\neq iitalic_j ≠ italic_i), across all subject-relation pairs in T𝑇Titalic_T:

Consist=1|T|tTi,j:ij,i,j|Pt|𝟙[w^it=w^jt]12|Pt|(|Pt|1)absent1𝑇subscript𝑡𝑇subscript:𝑖𝑗formulae-sequence𝑖𝑗𝑖𝑗subscript𝑃𝑡1delimited-[]subscriptsuperscript^𝑤𝑡𝑖subscriptsuperscript^𝑤𝑡𝑗12subscript𝑃𝑡subscript𝑃𝑡1\displaystyle=\frac{1}{|T|}\sum_{t\in T}\frac{\sum_{i,j:i\neq j,i,j\leq|{P_{t}%}|}{\mathbbm{1}}{[\hat{w}^{t}_{i}=\hat{w}^{t}_{j}]}}{\frac{1}{2}|{P_{t}}|(|{P_%{t}}|-1)}= divide start_ARG 1 end_ARG start_ARG | italic_T | end_ARG ∑ start_POSTSUBSCRIPT italic_t ∈ italic_T end_POSTSUBSCRIPT divide start_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j : italic_i ≠ italic_j , italic_i , italic_j ≤ | italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_POSTSUBSCRIPT blackboard_1 [ over^ start_ARG italic_w end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = over^ start_ARG italic_w end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG | italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | ( | italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | - 1 ) end_ARG(1)

2.4 Reliability

The reliability of PLMs reflects the extent to which we can trust the predictions they provide.In our study, we measure PLMs’ overconfidence level in making fact prediction, drawing from the expected error calibration metricDesai and Durrett (2020).Specially, we measure the difference between true prediction accuracy and models’ confidence to their predicted tokens.For each prompt, we first acquire the maximum probability (hereafter, confidence) from the output distribution for the mask token.Subsequently, all of the prompts are arranged in descending order based on confidence and segmented into M𝑀Mitalic_M bins (P(1)superscript𝑃1P^{(1)}italic_P start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT, P(2)superscript𝑃2P^{(2)}italic_P start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT, …, P(M)superscript𝑃𝑀P^{(M)}italic_P start_POSTSUPERSCRIPT ( italic_M ) end_POSTSUPERSCRIPT), with the same amount of data points in each bin.For each bin i𝑖iitalic_i, we compute the average accuracy Acc@1¯(i)superscript¯Acc@1𝑖\overline{\textrm{Acc}@1}^{(i)}over¯ start_ARG Acc @ 1 end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT and average confidence omax¯(i)superscript¯subscript𝑜𝑚𝑎𝑥𝑖\overline{o_{max}}^{(i)}over¯ start_ARG italic_o start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT.In our work, we use M=10𝑀10M=10italic_M = 10 for all the experiments.Finally, the PLM’s overconfidence in predicting the object is assessed by averaging differences between average confidence and accuracy across all bins:

Ovconf=i=1M|P(i)|M(omax¯(i)Acc@1¯(i))absentsuperscriptsubscript𝑖1𝑀superscript𝑃𝑖𝑀superscript¯subscript𝑜𝑚𝑎𝑥𝑖superscript¯Acc@1𝑖\displaystyle=\sum_{i=1}^{M}\frac{|P^{(i)}|}{M}(\overline{o_{max}}^{(i)}-%\overline{\textrm{Acc}@1}^{(i)})= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT divide start_ARG | italic_P start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | end_ARG start_ARG italic_M end_ARG ( over¯ start_ARG italic_o start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT - over¯ start_ARG Acc @ 1 end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT )(2)

The closer the Ovconf is to zero, the more aligned the model’s confidence is with its accuracy, indicating reliable confidence. A negative Ovconf value means the model is underconfident.

3 BELIEF-ICL for Decoder-based LLMs

Recent LLMs arebased on decoder-only Transformer architecture, and are trained to predict subsequent tokens in a sequence. This makes it challenging for them to directly predict [MASK] tokens in masked prompts, as they cannot utilize information following the [MASK] (e.g., “[MASK] and Tokyo are twin cities”).To comprehensively evaluate LLMs and enable fair comparison between encoder- and decoder-based models, we extend BELIEF to LLMs by employing in-context learning (ICL), termed BELIEF-ICL.

3.1 In-context learning for fact probe

The in-context learning ability allows LLMs to perform complex tasks during inference using task-specific promptsBrown etal. (2020).When designing ICL for evaluating factual knowledge, it is essential to consider task instructions and context examples appended to the target prompts.

1) Task instruction:We introduce the mask prediction (MP) instruction for prompting LLMs generating one word answer for the target masked prompt.The task instruction is formulated as ‘‘Predict the [MASK] in each sentence in one word.’’.

2) Context settings:We propose four types of contexts to assess the impact of examplar selection on factual knowledge probing, following the QA format outlined in InstructGPTOuyang etal. (2022).zero-shot uses only instructions; X-random samples X facts from all relations as the few-shot demonstrations; X-relation samples X facts from the same relation but with random templates; X-template samples X facts from the same relations and the same template.

In the few-shot learning settings, we ensure that the target fact is excluded in the examples.Refer to §E for examples of prompts.

3.2 Evaluation methods

Since LLMs generate responses without a token limit, matching the correct answer with the model’s output can be challenging.Variations in language expressions, such as the presence or absence of articles and singular or plural forms, complicate this process.Additionally, the model may generate extra tokens not relevant to the [MASK] token, such as parts of the prompt.For example, for the prompt “John Lennon can play [MASK],” both “guitars” and “a guitar” should be considered correct.

To measure BELIEF metrics for LLMs, we compare two strings: the generated text and the correct object expression for Acc@1, and two generated texts for Consist and Ovconf.Here, we first normalize strings by tokenizing and lemmatizing them.For example, “a guitar” and “guitars” are normalized to “a, guitar” and “guitar.”If a normalized string list is included in the other (partial matching), they are considered matched.

1) Accuracy and its fluctuations:Accuracy is calculated by comparing the string generated by the model using a greedy decoding strategy to the correct answers.Notably, the matching judgment is one-directional: it only checks if the correct answer is included in the generated string.One-directional matching is adopted to avoid incorrect judgments from the model generating unrelated words.We use the same N𝑁Nitalic_N as in §2.2 for accuracy measurement.

2) Consistency:We use bi-directional matching to evaluate the consistency (Consist) of generated sequences from two prompts.

3) Reliability:To calculate overconfidence, we need the model’s confidence (probability) in its output.However, we cannot obtain this directly from the probability over generated tokens, as LLMs can produce diverse outputs that represent the same answer.To address this, we propose an approximate measurement.For each prompt, we generate 100 samples using multinomial sampling222Multinomial sampling selects a next token according to the probability, over the entire vocabulary given by the model.. We then measure the matching rate between the output generated from greedy decoding and the outputs from the 100 samples. This matching rate serves as the confidence value for the prompt333This method can approximate the overconfidence calculation in BELIEF of sampling answers from the output distribution. It makes the confidence calculated by BELIEF for encoder-based models comparable to that in BELIEF-ICL..The calculation of Ovconf follows the same setting in §2.4.This method can approximate the BELIEF’s Ovconf calculation as BELIEF sampling answers from the output distribution.Note that, due to the high cost of generating 100 samples for each fact, we adopt a more efficient approach. We sample 10K prompts from 10K unique subject-relation pairs and only use these 10K prompts for answer sampling.

4 MyriadLAMA Dataset

The fairness and accuracy of BELIEF evaluation depend on the diversity and quality of multi-prompt factual probing datasets. However, existing datasets are either manually rewritten in small numbersElazar etal. (2021) or mined from textsJiang etal. (2020).The former is accurate but lacks diversity, providing an average of 7.3 prompts per fact with limited variation.For example, templates like “[X] works as [Y]” and “[X], who works as [Y]” are provided as different templates but very similar.Additionally, the number of templates is highly imbalanced, with 8 out of 46 relations having only one template, while P138444https://www.wikidata.org/wiki/Property:P9138 has 20.The latter is diverse but includes templates that do not necessarily imply the relationship.For instance, for relation P937 (work location)555https://www.wikidata.org/wiki/Property:P937, the mined templates include “[X] to meet [Y].,” which significantly deviates from the original meaning.To achieve a more accurate and fair evaluation, we introduce MyriadLAMA, a new multi-prompt factual probing dataset with improved diversity while retaining quality.Refer to §B for detailed qualitative and quantitative comparisons between MyriadLAMA and prior datasets.

4.1 Dataset construction

We build MyriadLAMA by semi-automatically extending the existing single-prompt probing dataset LAMA-UHNPetroni etal. (2020).MyriadLAMA generates multiple prompts for each fact by providing multiple, equal relational templates for each relation and varying the linguistic expressions of subjects. Additionally, MyriadLAMA offers multiple expressions for each object to cover missed facts that are correctly predicted but in different tokens. For example, for the query “John Lennon was born in [MASK]”, acceptable tokens could include “UK” and “Britain.”666We follow the setting of LAMA-UHN triples where the object is a single token according to the BERT tokenizer. During evaluation, we consider the fact to be present, if the model’s predicted token matches any of the correct tokens, regardless of which correct answer is predicted.

Specifically, we define knowledge triples that neglect the diversity of surface expressions as unique triples and distinguish them from derived triples, which embody the diverse entity expressions and relational templates in each unique triple.The triple extension methods are described below.

Extending entities:The knowledge triples in LAMA-UHN constitute a subset of the Wikipedia knowledge base T-REx Elsahar etal. (2018).T-REx selectively includes only certain objects for subject-relation pairs.MyriadLAMA extends the unique triples in LAMA-UHN by mining T-REx using subject-relation as key to include other available objects.For example, if LAMA-UHN contains only E_{guitar} for instruments that “John Lennon” can play, we can extend the unique triple to include E_{piano}.We also extend the entity expressions using aliases obtained from Wikidata.777https://www.wikidata.org/wiki/Wikidata:Data_access

Paraphrasing relational templates:MyriadLAMA creates a great variety of relational templates by a semi-automatic process.Firstly, we manually generate five distinct templates for each relation.They incorporate entailment expressions and diverse syntactic patterns like statements and question-answer formats to provide semantic and syntactic variations.Next, to enhance quantity and lexical diversity, we automatically paraphrase each manually created template 19 times using the GPT-4 API.888OpenAI: gpt-4-1106-preview Finally, all templates are filtered by human reviewers to remove low quality templates, yielding a total of 4100 templates covering 41 relations.

4.2 Dataset Statistics

LAMA-UHNMyriadLAMA
Relational templates414100
Unique triples27,10634,048
Derived triples27,10621,140,500
Subject-relation pairs24,64324,643
Prompts24,6436,492,800

Table1 lists the statistics of MyriadLAMA.The number of derived triples is increased from 27,106 in LAMA-UHN to 21,140,500, by combining various semi-automatically generated relational templates and the alias expressions for subject and object entities.As the prompts are generated from derived triples without considering the object expressions, the number of generated prompts are less than the number of derived triples, which is increased from 24,643 to 6,492,800.Refer to the appendices for details on dataset construction (§A) and validity analysis of MyriadLAMA (§C).Examples of extended templates are provided in §A.3.

5 Effectiveness of BELIEFs

5.1 Experimental setups

We use BELIEFs to evaluate the knowledge recall abilities of both encoder- and decoder-based PLMs.The target encoder-based PLMs include BERTbase, BERTlarge, and BERTwwm.999BERTwwm masks all tokens for a single word at the same time, while BERTbase and BERTlarge masks a single token.The target decoder-based LLMs include Llama2 (7B, 13B, and 70B) and Llama3 (8B and 70B), without and with instruction tuning (except for Llama3-70B),along with Phi3 (mini, small, and medium).Their brief pretraining information are listed in Table2.Refer to §F for more details.

PLMs (#params)Pre-training corpora
sizesource
BERTbase (110M)3.3B+3.3B+3.3B+}cases3.3B+3.3B+3.3B+\left.\begin{array}[]{@{}l@{}}\text{3.3B+}\\\text{3.3B+}\\\text{3.3B+}\end{array}\right\}start_ARRAY start_ROW start_CELL 3.3B+ end_CELL end_ROW start_ROW start_CELL 3.3B+ end_CELL end_ROW start_ROW start_CELL 3.3B+ end_CELL end_ROW end_ARRAY }English Wikipedia& BookCorpus
BERTlarge (336M)
BERTwwm (336M)
Llama2-7B(-IT) (7B)2.0T2.0T2.0T}cases2.0T2.0T2.0T\left.\begin{array}[]{@{}l@{}}\text{2.0T}\\\text{2.0T}\\\text{2.0T}\end{array}\right\}start_ARRAY start_ROW start_CELL 2.0T end_CELL end_ROW start_ROW start_CELL 2.0T end_CELL end_ROW start_ROW start_CELL 2.0T end_CELL end_ROW end_ARRAY }A collection of publiclyavailable online data.
Llama2-13B(-IT) (13B)
Llama2-70B(-IT) (70B)
Llama3-8B(-IT) (8B)15T+15T+}cases15T+15T+\left.\begin{array}[]{@{}l@{}}\text{15T+}\\\text{15T+}\end{array}\right\}start_ARRAY start_ROW start_CELL 15T+ end_CELL end_ROW start_ROW start_CELL 15T+ end_CELL end_ROW end_ARRAY }A collection of publiclyavailable online data.
Llama3-70B (70B)
Phi3-mini (3.8B)4.9T4.9T4.9T}cases4.9T4.9T4.9T\left.\begin{array}[]{@{}l@{}}\text{4.9T}\\\text{4.9T}\\\text{4.9T}\end{array}\right\}start_ARRAY start_ROW start_CELL 4.9T end_CELL end_ROW start_ROW start_CELL 4.9T end_CELL end_ROW start_ROW start_CELL 4.9T end_CELL end_ROW end_ARRAY }High-quality educationaldata/code/chat & synthetictextbook-like data
Phi3-small (7B)
Phi3-medium (14B)

We conduct a full-scale evaluation on LLMs with up to 8 billion parameters.To save cost of LLM inference, we use five manually rewritten templates only for the LLMs with more than 8B parameters, including Llama2-70B and its IT variant Llama2-70B-IT, Llama3-70B, and Phi3-medium.101010The partial evaluation is sufficient to compare performance across different model sizes.To calculate the average and fluctuation of accuracy (§2.2), we set a large sample number (N=50,000𝑁50000N=50,000italic_N = 50 , 000) to provide stable, accurate result.

In the following sections, we analyze the evaluation results on various PLMs to deepen our understanding of how PLMs learn and represent factual knowledge.All evaluation results, including those for another family of encoder-based models, ALBERT, are presented in Section §F.3.

PLMsAcc@1Fluctuation\downarrowConsist \uparrowOvconf
LUMyLrangeSD
BERTBERTbase.2403.1095.1534.0217.1682.2154
BERTlarge.2454.1102.1574.0220.1713.2052
BERTwwm.2448.1364.1517.0208.1524.1000
Llama3-8Bzero-shot.3708.3427.2864.0350.0240-.1119
4-random.5050.5205.2033.0273.2156-.0789
4-relationn/a111111X-relation cannot be applied to single-prompt dataset..6871.1236.0156.3659-.0783
4-template.6490.7268.0220.0026.4015-.0582

5.2 Do BELIEFs provide additional insights?

BELIEFs offer evaluation from diverse perspectives rather than accuracy.As shown in Table3 (Above), the evaluation result highlights accuracy fluctuations among the BERT variants.All BERT models show low consistency and tend to be overconfident in their predictions.Figure2 (left) depicts the relationship between confidence and Acc@1 of the BERT models, indicating low accuracy even for prompts with confident outputs.Whereas BERTwwm performs better over most BELIEF metrics, BERTlarge outperforms BERTwwm on LAMA-UHN.This discrepancy arises from the limited prompts used in LAMA-UHN and thethe single-facetedevaluation method.This highlights BELIEF’s effectiveness in achieving a more accurate factual probing comparison between PLMs.

What Matters in Memorizing and Recalling Facts? Multifaceted Benchmarks for Knowledge Probing in Language Models (2)

5.3 Does ICL adhere to instructions?

We then explore the effectiveness of different ICL settings in extracting facts from LLMs.We evaluate the instruction adherence of these settings from two aspects: predicting facts and generating one-word answers, reflecting that the target objects in MyriadLAMA are primarily one-word entities.

Table4 shows Acc@1 and one-word generation ratio of two pretrained LLMs (Llama2-7B and Llama3-8B) and one instruction-tuned LLM (Phi3-small).We found that under few-shot settings, even the pretrained LLMs exhibit a remarkable ability to follow instructions, indicating the effectiveness of prompting LLMs to predict mask tokens through in-context learning.Our evaluation with QA-style ICL settings also confirms this (see §D for details).Moreover, exemplars similar to the target prompt (4-template) in the context boosted improvement overall metrics (Table3 Below, Table4).

ICL settings Fact prediction(Acc@1)1-word ratio
(Llama2-7B / Llama3-8B / Phi3-small)
zero-shot.3385/.3427/.4258.4802/.1572/.8883
4-random.4816/.5205/.4889.8058/.8147/.8913
4-relation.6286/.6871/.6339.9246/.9071/.9287
4-template.6616/.7268/.6612.9266/.9187/.9411

5.4 Can BELIEFs mitigate bias?

We explore whether BELIEFs can mitigate prompt bias in evaluations.To measure prompt bias quantitatively, we use content-free prompts, where the subject is replaced by meaningless tokensZhao etal. (2021); Xu etal. (2024), and collect the probabilities of candidate tokens in the output distributions over the mask token.121212Specifically, we adopt a similar setting to Zhao etal. (2021), by ensembling the distribution over prompts with three content-free tokens: “N/A,” an empty string, and “?”.We measure the bias level of the prompt using the certainty of distributions over candidate tokens.Specifically, we define bias level as follows:

bias-level=1maxabsent1subscript𝑚𝑎𝑥\displaystyle=1-\frac{\mathcal{H}}{\mathcal{H}_{max}}= 1 - divide start_ARG caligraphic_H end_ARG start_ARG caligraphic_H start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT end_ARG(3)

where \mathcal{H}caligraphic_H is the entropy, and maxsubscript𝑚𝑎𝑥\mathcal{H}_{max}caligraphic_H start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT is the maximum entropy for the uniform distribution with same size.

We measure bias in both single- and multi-prompt evaluations. In single-prompt evaluation, we represent bias as the average level across all relational templates.For measuring bias-level in multi-prompt evaluation, we first average output distributions by different templates for each relation, then use the bias-level of the averaged distribution to quantify it.Taking P31:instance-of131313https://www.wikidata.org/wiki/Property:P31 as an example, the average probability of “science” over all templates is 8.30%, but it rises to 52.79% for template: “[Y] contains [X] as one of its elements.”

6 Differentiating PLMs in Fact Probing

This section compares the PLMs’ knowledge recall abilities in terms of accuracy, reliability, and robustness and then explores factors affecting them.

6.1 Factors affecting the recall accuracy

1) Pre-training strategy.Table3 confirms that BERTwwm outperforms BERTlarge in terms of all metrics, while BERTwwm differs from BERTlarge only in the masking strategy during pre-training.The superiority of BERTwwm likely stems from its challenging pre-training paradigm, which requires recalling whole words without sub-token information, enhancing word-level contextual understanding.This underscores the importance of pre-training strategy in knowledge acquisition.

PLMsAcc@1Fluctuation\downarrowConsist\uparrowOvconf
rangeSD
Llama2-7B.6699.0257.0034.4174-.0933
Llama2-13B.7080.0235.0031.4326-.0662
Llama2-70B.7784.0190.0024.4449-.0690
Llama2-7B-IT.6013.0368.0045.3629.2007
Llama2-13B-IT.6482.0301.0038.3656.1708
Llama2-70B-IT.7232.0258.0031.4226.1026
Llama3-8B.7316.0194.0025.4060-.1119
Llama3-70B.8211.0139.0017.4636-.0812
Phi3-mini.6106.0314.0039.3686.0911
Phi3-small.6668.0306.0039.3667.1221
Phi3-medium.7100.0207.0025.4009.0317

2) Model size.Table5 compares the knowledge recall abilities of LLMs with difference sizes.141414Owing to the high computational cost of inference on large LLMs like Llama2-70B, we select only five manually rewritten templates with 4-template ICL setting for evaluation.We can observe that larger LLMs consistently achieve higher accuracy in predicting facts.Combining with the improvement from BERTbase to BERTlarge from Table3,the importance of model size in fact acquisition during pre-training is confirmed.

PLMsAcc@1Fluctuation\downarrowConsist\uparrowOvconf
rangeSD
Llama2-7B-IT.2925.1980.0253.1151.2605
Llama3-8B-IT.3578.2213.0262.1660.1402
Phi3-mini.4258.2437.0292.1782.2171

3) Pre-training corpora.Table5 shows that Llama3-8B outperforms larger Llama2-13B in fact probing. This is likely due to Llama3’s pre-training corpus being seven times larger than Llama2 (Table2).Meanwhile, Llama3-70B surpasses Llama2-70B, confirming the importance of pre-training data volume for fact acquisition.

In the zero-shot evaluation using the entire MyriadLAMA, as shown in Table6,Phi3-mini outperforms Llama2-7B-IT and Llama3-8B in knowledge retrieval.Given that Phi3-mini (3.8B) has about half the model size of Llama2-7B-IT and Llama3-8B-IT, and model size typically enhances knowledge retrieval, this result is notable.This superior performance can be attributed to the high-quality, textbook-like material used for pre-training the Phi3 models, highlighting the significant impact of high-quality training data.

4) Instruction-tuning.Table7 confirms that instruction-tuned Llama2-7B-IT exhibit a higher one-word generation rate than Llama2-7B, as expected.However, the instruction-tuned LLM consistently demonstrate lower Acc@1 scores on different ICL settings.This indicates a potential negative impact of instruction-tuning on the models, where general language understanding can improve, but some factual knowledge is partially lost as a result of the tuning process.

5) Inclusion and selection of demonstrations. As shown in Table7 and Table18, using demonstrations in prompts consistently improves Acc@1.Including few-shot demonstrations with same templates to the target question can nearly double Acc@1 values (from zero-shot to 4-template settings).Closer demonstrations also enhance performance across all metrics, highlighting a significant gap between the factual knowledge LLMs memorized and what they can actually recall.

PLMsAcc@1Fluctuation\downarrowConsist\uparrowOvconf 1-wordratio
rangeSD
Llama2-7B0-shot.3385.2602.0299.1269-.1119.4752
4-rand..4816.2250.0270.2312-.0894.8247
4-rel..6286.1221.0150.3753-.1335.9060
4-templ..6616.0294.0036.4163-.0933.9299
Llama2-7B-IT0-shot.2925.1980.0253.1151.2605.9069
4-rand..4334.1958.0229.2128.2410.9081
4-rel..5576.0791.0092.3341.1900.9314
4-templ..5896.0439.0050.3687.2061.9380

6.2 Factors affecting the reliability

Table3 shows a significant difference in Ovconf between BERT models and Llama3-8B, with BERT models being overconfident and Llama3-8B being underconfident.In this section, we explore the reasons for these differences and investigate additional factors affecting reliability beyond model size.

1) The number of output tokens.One main difference in Ovconf calculation between encoder- and decoder-based PLMs is that the decoder-based PLMs will generate multiple tokens.Thus, we investigate the effect of output token count on Ovconf values. We divide the MyriadLAMA prompt set into groups based on the number of tokens generated.For each group, we calculate the probability of the entire token sequence and compute Ovconf for token counts from 1 to 5.151515The prompts generated within five tokens cover 98.78% of Llama3-8B’s generations with the 4-template ICL setting.

The Ovconf values for each group of 1 to 5 output tokens on Llama3-8B (4-template) are -0.1030, -0.0906, -0.0297, -0.0546, and 0.0573, showing models become more overconfident with more output tokens.This trend is consistent across models.

What Matters in Memorizing and Recalling Facts? Multifaceted Benchmarks for Knowledge Probing in Language Models (3)

2) Instruction-tuning inflates LLMs confidence.Table7 and Figure3 confirm that instruction-tuned LLMs make the models overly confident in their outputs.The pretraining use more diverse language data with uncertainties, which can lead to a more calibrated output confidence.Instruction-tuning narrow the LLMs’ exposure to specific tasks, reducing its ability to express uncertainty and making it more likely to provide overconfident outputs.

3) Model size:Large models consistently demonstrate improved reliability, as illustrated in Table5.

6.3 Factors affecting the robustness

1) Larger model cannot make zero-shot knowledge prompt more robust.Similar to accuracy and reliability, few-shot knowledge prompts show improved robustness against accuracy fluctuations and consistency as model size increases.However, this effect is absent in zero-shot settings.For instance, the SD for the Llama2 family are 0.2014, 0.2131, and 0.2126 for the 7B, 13B, and 70B models, respectively.Similar inconsistencies are observed across other LLM families with varying model sizes.Refer Table18 for more details.

What Matters in Memorizing and Recalling Facts? Multifaceted Benchmarks for Knowledge Probing in Language Models (4)

2) Instruction-tuning making fluctuation less influenced by context.Table7 and Table17 show that the instruction-tuned models exhibit reduced fluctuation (smaller range and SD) in the zero-shot, 4-random, and 4-relation ICL settings, but perform worse in the 4-template setting.This suggests that instruction-tuned models become less influenced by context and more reliant on the instruction itself.

In contrast, the Consist measure consistently decreased in instruction-tuned models, suggesting that while instruction-tuning improves instruction interpretation, it may weaken semantic understanding, especially with paraphrases.

6.4 How do PLMs perceive facts differently?

Finally, we measure the differences in fact coverage among models.We first collect the correctly predicted facts for each template, defining these as the model’s covered facts.Given the covered facts of two models, we measure the knowledge sharing rates using an asymmetric metric. This metric calculates the proportion of shared facts relative to each model’s total covered facts.

Figure4 shows the results. The average sharing rate among BERT models is 69.1%, while it is 68.7% for Llama2-7B,Llama2-7B-IT, and Phi3-mini in the zero-shot setting.In comparison, the average sharing rate between encoder- and decoder-based PLMs reduced to 47.1%.Meanwhile, knowledge sharing rates in both zero-shot and 4-template settings indicate that incorporating examples improves the knowledge elicited by the PLMs.However, about 10% of knowledge can still only be elicited in the zero-shot setting (see red boxes).

PLMsAverageMaximumOracle
BERTwwm.1364.4501.6636
Llama2-7B (zero-shot).3385.6577.8153
Llama3-8B (zero-shot).3427.7099.8756
Phi3-small (zero-shot).4258.6828.8642
Llama2-7B (4-template).6616.7197.8133
Llama3-8B (4-template).7268.7731.8628
Phi3-small (4-template).6612.7181.8346

7 Limitation of Prompt-based Probing

Finally, we examine the limitation of prompt-based knowledge probing by using our massively diverse dataset.First, we gauge the average knowledge coverage rate by using the average Acc@1 (average).Next, for each relation, we calculate the maximum Acc@1 using the template that yields the highest accuracy,161616We select the prompt with the best subject expression among prompts for each fact.and use this value to estimate the upper limit of prompt-based knowledge probing (maximum).Finally, we approximate the upper limit of facts contained in LLMs, by considering a fact as existing if at least one prompt of this fact can produce the correct answer (oracle).

Table8 shows three knowledge coverage rates on some PLMs.For PLMs with zero-shot settings (including BERTwwmwwm{}_{\textbf{wwm}}start_FLOATSUBSCRIPT wwm end_FLOATSUBSCRIPT), we observe nearly a 30% increase between average and maximum accuracy, emphasizing the importance of selecting suitable templates for specific facts and the potential gains from prompt engineering.This gap can be reduced to 5% with few-shot settings.However, the gap between maximum and oracle accuracy mostly remains. This indicates that different facts prefer different templates, suggesting no versatile template works for all facts.Combining templates reveals the true upper limits of PLMs’ knowledge memorization and highlights the importance of using diverse prompts over optimizing a single one for retrieval.Refer to §F.4 for results on more PLMs.

8 Related Work

The LAMA probe was first proposed to evaluate the utility of PLMs as knowledge bases via solving the fill-in-the-blank taskPetroni etal. (2019).Several researchers extend the LAMA probe to evaluate PLMs’ ability to understand facts from diverse linguistic aspects, such as the effect of negation/misprimingKassner and Schütze (2020), distractorsPandia and Ettinger (2021), multilingual understandingKeleg and Magdy (2023); Zhao etal. (2024) and models’ consistency facing prompts with minor nuancesFierro and Søgaard (2022); Elazar etal. (2021).However, these studies lack the inspection of PLMs’ reliability in knowledge prediction, which is vital in deploying LLMs to real-world tasks. Moreover, solving the fill-in-the-blank task by LLMs with the causal LM objective can underestimate their knowledge recall ability.

Recently, QA-based datasets have been developed to evaluate the knowledge recall ability of decoder-only LMs. Kalo and Fichtel (2022) created a high-quality QA prompt set, which is further extended by Wiland etal. (2024) to evaluateboth causal and masked LMs. Mallen etal. (2023) and Maekawa etal. (2024) developed QA datasets to see the impact of knowledge popularity and retrieval augmentation.Since the writing style of these datasets is limited to questions, we cannot perform reliable robustness evaluation.

9 Conclusions

This paper presents the multi-faceted factual probing benchmarks, BELIEF and BELIEF-ICL, for encoder- and decoder-based PLMs, respectively.Leveraging a multi-prompt dataset, BELIEFs provide various evaluation metrics, including accuracy, consistency, and reliability, enabling a thorough evaluation of PLMs’ knowledge recall abilities.To make BELIEFs more reliable, we build a new multi-prompt dataset for knowledge probing, MyriadLAMA, featuring diverse prompts for each fact.We conducted extensive experiments of multiple encoder-based PLMs and recent LLMs.

Based on the evaluation results, we identify key factors affecting the accuracy, reliability and robustness of PLMs’ fact recall, such as model size, pre-training strategy and corpora, and ICL settings.We also reveal the negative effect of instruction-tuning in recall factual knowledge from LLMs.This highlights the need for careful design of instruction-tuning to preserve LLMs’ knowledge recall abilities.Finally, by probing facts in different ways, we find that PLMs hold more knowledge than what is revealed by using the optimal template, highlighting the limitations of prompt-based factual probing.

10 Limitations

MyriadLAMA contains an extensive amount of prompts, which leads to high evaluation costs.In the future, we aim to extract a diverse yet robust subset from MyriadLAMA to enable a more efficient evaluation of factual knowledge.Additionally, MyriadLAMA is built upon LAMA-UHN, which includes only 41 relationships.Expanding the range of relations is essential to improve coverage in the evaluation of factual knowledge.Lastly, we need to evaluate closed-source LLMs, such as GPT-4 and Claude, to examine performance differences between them and open-source LLMs.

Acknowledgements

This work was partially supported by the special fund of Institute of Industrial Science, The University of Tokyo, by JSPS KAKENHI Grant Number JP21H03494, and by JST, CREST Grant Number JPMJCR19A, Japan.

References

Appendix A Construction of MyriadLAMA

In this appendix, we explain the detailed procedure for generating the derived triples from unique triples in MyriadLAMA.As discussed in §4, this study first extends the unique triples contained in LAMA-UHNPetroni etal. (2020) by searching new objects from T-RExElazar etal. (2021).Next, for the obtained unique triples, we generate derived triples by combining concrete linguistic expressions associated with entities (“subjects” and objects) and diversify relational templates using both manual labor and LLMs. We describe the detailed procedure as following.

A.1 The extension of entities

Extension of unique triples from T-REx

LAMA-UHN is a refined subset derived from the LAMA dataset, which LAMA originates from T-RExElsahar etal. (2018). T-REx is a large-scale knowledge base containing 11 million real-world knowledge triples, aligned with 3.09 million Wikipedia abstracts, designed to create large-scale alignments between Wikipedia abstracts and Wikidata triples. To achieve this alignment, T-REx employed three distinct aligners—NoSub, AllEnt, and SPO—each offering varying levels of accuracy (0.98, 0.96, and 0.88, respectively) as measured on a test set.Despite the high alignment accuracy of all three aligners, LAMA-UHN selects only the triples aligned by NoSub, the aligner with the highest accuracy. While this choice ensures the high correctness of triples within LAMA, it potentially compromises the ability to fairly assess a PLM’s knowledge recall ability, as it may overlook valid answers during evaluation.To address this limitation, we expand the MyriadLAMA dataset by incorporating triples aligned by all three aligners—NoSub, AllEnt, and SPO—found in T-REx, based on the subject-relation pairs present in LAMA-UHN.As the result, we increase the number of unique triples from 27,106 to 34,048 as shown in Table1.

Extension of entities using aliases

Next, we utilize aliases of entities obtained from Wikidata to acquire diverse linguistic expressions (and their paraphrases) for the “subjects” and objects.Specifically, we used the Wikidata identifiers of entities171717https://www.wikidata.org/wiki/Wikidata:Identifiers and the Wikidata API181818https://www.wikidata.org/wiki/Special:EntityData/<entity_identifier>.json to retrieve the (English) alias expressions of entities. By combining the aliases of “subjects” and objects with the relation templates mentioned later, we generate numerous new derived triples.If N𝑁Nitalic_N “subjects” and M𝑀Mitalic_M objects are given for an unique triple, the number of derived triples according to this unique triple generated from a single relational template is N×M𝑁𝑀N\times Mitalic_N × italic_M.

A.2 Diversification of relation templates

We use a two-step procedure to create new relational templates, to enhance ensure both the quality and quantity.Initially, we manually rewrite relational templates, ensuring that every relation has five templates.Then, we employ the generative LLM (GPT4) to automatically paraphrase 19 additional templates. In total, we produce 100 templates for each relation.

Step 1: Manually rewriting relational templates.

The manual rewriting of the relational templates is performed by the first author of this paper.We create new templates by describing the relationship between subject and object from different perspectives rather than creating templates with absolutely the same meaning with original template.Utilizing the resource provided by Wikidata 191919https://www.wikidata.org/wiki/Property:<relation_identifier>, we not only paraphrase existing templates to generate new ones with diverse lexicons but also devise entailment expressions to encompass various semantic expressions that convey the same relations.These newly created templates are guaranteed to uphold relational equivalence, following the relationship between the subject and object.Taking P20 ([X] died in [Y].)202020https://www.wikidata.org/wiki/Property:P20 as an example, we create new templates by either changing the sentence pattern or adding type information of object (e.g, [X] resided in [Y] until death). Furthermore, we also create templates without directly using the keywords of the relation (dead/death) but in a entailment way (e.g., [X] spent the last years of life in [Y].)Moreover, we devise a question-answer style template for each relation to enhance syntactic diversity. In this template, the question incorporates the subject and relation information, while the answer corresponds to the object.

Note that, during the paraphrase, we observe that some templates in LAMA-UHN only partially express the original meaning of relations defined in Wikidata. These are inappropriate for specific knowledge triples.For example, P136 describes the creative work’s genre or an artist’s field of work212121https://www.wikidata.org/wiki/Property:P136, which the type of work includes music, film, literature, etc.However, the original templates of P136 in LAMA-UHN is “[X] plays [Y] music.,” which cannot correctly retrieve information on work other than music.For this kinds of template, we abandon the original templates and newly create five templates.

Step 2: Paraphrasing templates using GPT-4

Based on the original relation templates and the relation templates rewritten manually, we further paraphras these relation templates automatically using the GPT4-API (gpt-4-1106-preview222222https://platform.openai.com/docs/models/gpt-4-and-gpt-4-turbo) provided by OpenAPI.The instruction for paraphrasing used for GPT-4 generation is:

You are a professional tool that can paraphrase sentences into natural sentences that can correctly represent the relationship between [X] and [Y], without repetition. Make the paraphrase as diverse as possible using simple words. Please paraphrase the given sentence 19 times.

When the duplicated sentence is generated, we remove the duplication and regenerate new templates with the same instruction, until 19 different templates is generated.Furthermore, we observe that GPT-4 occasionally generates relation templates that are semantically inappropriate for specific relationships due to incorrect category information of entities. Consequently, in such instances, we refine the instructions to include the category information of the entities, ensuring accurate representation of the relationship between the subjects and the objects.For example, when paraphrasing the relational template “[X] used to work in [Y].”232323https://www.wikidata.org/wiki/Property:P937, we additionally add explicit guidance regarding the expected format and semantics of the relation templates to the above instruction, as following.

Be aware that [Y] is the geographic location but NOT company or organization, where persons or organizations were actively participating in employment, business or other work.

As a result, we can obtain the following paraphrased relational templates for “[X] used to work in [Y].”:

  • “[X] was formerly employed in [Y].”

  • “[X] once worked at [Y].”

  • “[Y] was the place where [X] used to be engaged in work.”

A.3 Example of extended relational templates in MyriadLAMA

We display part of the created templates in MyriadLAMA.We randomly select two manually rewritten templates and three auto-generated templates of these two templates.We show the sampled templates for all the relations in Tables20,21,22,23,24 and25.

Appendix B The Advangage of MyriadLAMA

Given that our study seeks to mitigate the influence of individual prompt bias in evaluations, the availability of a wide range of prompts characterized in both quantity and diversity is crucial.The diversity ensures that different prompts can capture different aspects of the true knowledge distribution.On the other side, the quality or correctness of prompts ensure the evaluation can accurately reflect the true knowledge recall ability.

In this section, we provide a quantitative analysis of the quality and diversity of multi-prompt factual knowledge probing datasets.The comparison results demonstrate the superiority of MyriadLAMA over previous datasets, enabling more accurate and comprehensive evaluations.We conduct comparison between MyriadLAMA and other multi-prompts probing datasets, LPAQAJiang etal. (2020) and PARARELElazar etal. (2021), from the perspective of quantity and diversity.

B.1 Diversity comparison

We measure the diversity of multi-prompt factual knowledge probing datasets from both quantity and linguistic diversity.

Specifically, we calculate the average number of prompts for each subject-relation pair as the quantity measure.MyriadLAMA introduces diversity into prompts by using various subject expressions and relational templates.On average, MyriadLAMA provides 2.47 expressions for each subject.In addition, we measure the linguistic diversity of relational templates from three aspects, as shown below:

Lexicon:

We utilize the Jaccard distance of words in templates to gauge lexicon diversity.

Syntax:

We adopt the syntax distance measure proposed in Oya (2020), which calculates the distance between dependency trees.

Semantics:

We quantify semantic diversity by calculating the L2 distance of sentence embeddings given by BERTlarge.

The results are shown in Table9.MyriadLAMA demonstrates superior quantity and diversity compared to existing multi-prompt factual probing datasets.Although LPAQA exhibits greater semantic diversity, this is mainly due to its use of distant supervision to discover new templates.This method often results in problematic templates that inadequately describe the relationships between subjects and objects.For example, for relation P937242424https://www.wikidata.org/wiki/Property:P937 (“[X] used to work in [Y].”), LPAQA includes templates like “[X] to meet [Y],” which significantly deviate from the original semantic meaning.We analyze and compare template quality in the next section.

DatasetQuantity\uparrowDiversity\uparrow
LexiconSyntaxSemantic
PARAREL7.30.4860.148911.03
LPAQA53.27.5449.171313.55
MyriadLAMA263.47.6652.213812.69

B.2 Quality comparison

In this section, we evaluate the quality of relational templates created MyriadLAMA in correctly expressing the relation between the subject and object.We manually evaluate the quality templates created in each dataset through a strict quality evaluation framework.Specifically, we evaluate each template based on its fluency and its ability to correctly express the semantic relationship between subjects and objects.Given the complex and specific constraints defined by Wikidata relations, creating perfect templates that satisfy all subjects and objects for a given relation is challenging.

B.2.1 Semantic relationship between template and relation

[Template \subseteq Relation]: If the subject and object fit the template, it is correct for the relation, but the relation’s knowledge range is broader than the template can cover.We denote such templates as [Template \subseteq Relation].Using the templates in LAMA-UHN, which are often considered golden templates as examples, relation P136252525https://www.wikidata.org/wiki/Property:P1303 uses the template “[X] plays [Y] music.” to describe creative work genres or an artist’s field.However, P136 encompasses film, literature, and other arts, not just music.

[Relation \subseteq Template]: In contrast, If the subject-object pair is true for the relation, it is also true for the template, meaning the template’s knowledge range is broader than the relation.For example, LAMA-UHN creates the template “[X] died in [Y].” for P20262626https://www.wikidata.org/wiki/Property:P20.While this template can be used to infer a person’s place of death, “[Y]” could also be the year “[X]” passed away.

[Relation \cap Template > 0]: Additionally, some templates do not fit neatly into either [Relation \subseteq Template] or [Template \subseteq Relation] but can still correctly describe the relationship for some subject-object pairs.For example, PARAREL, which paraphrases templates by human effort, uses “[X] is a follower of [Y].” to describe relation P140272727https://www.wikidata.org/wiki/Property:P140 (religion or worldview).This template is appropriate for individuals but not for organizations, and it does not fully capture the relation’s scope.Therefore, it does not fit [Relation \subseteq Template].Additionally, when “[X]” is a person, “[Y]” can be either a religious figure, leader, or the religion itself, which does not satisfy [Template \subseteq Relation].However, many subject-relation pairs can still be correctly captured by this template. We use [Relation \cap Template > 0] to denote them.

[Irrelevant]: We consider templates that do not correctly convey the relationship between subject and object as [Irrelevant].For example, LPAQA mined many irrelevant templates from the corpus without careful checking, resulting in low-quality templates, such as “[X] arrived in [Y]” for P937.

[Fully matching]: Finally, we use [Fully matching] to denote templates that can accurately capture all the subject-object pairs fitting in the relation.

We demonstrate the five types of semantic relationships between template and relation in Figure5 using the Venn diagram.

What Matters in Memorizing and Recalling Facts? Multifaceted Benchmarks for Knowledge Probing in Language Models (5)

B.2.2 Template quality evaluation metrics

To accurately capture the fluency and the ability of created templates to correctly express the relationship between subjects and objects, we use the following metrics to score template quality.Each item is scored as either 1 or 0 based on whether the template meets the requirement.

1) Fluency:Is the template a natural sentence or noun phrase282828We take the noun phrase into consideration as LPAQA created lots of noun phrase templates, such as “[X], who works as [Y].“ for relation P106 (https://www.wikidata.org/wiki/Property:P106)? Set 1 if the templates are natural; Otherwise 0.

2) [Relation \subseteq Template]: If the template can satisfy the definition of [Template \subseteq Relation], then 1; Otherwise 0.

3) [Template \subseteq Relation]: If the template can satisfy the definition of [Template \subseteq Relation], then 1; Otherwise 0.

4) [Relation \cap Template > 0]: If the template can satisfy the definition of [Relation \cap Template > 0], then 1; Otherwise 0.

If either [Relation \subseteq Template] or [Template \subseteq Relation] is 1, then [Relation \cap Template > 0] must also be 1.If [Relation \subseteq Template], [Template \subseteq Relation], and [Relation \cap Template > 0] are all 0, the template can is classified as [Irrelevant].If all three metrics are 1, the template is classified as [Fully Matching].

Specifically, as what PLMs see is the prompt with subject, we will consider the existence of subject when scoring the template.For example, P413292929https://www.wikidata.org/wiki/Property:P413 describe the position or specialism of a player on a team.While the template “[X] plays in the position of [Y].” can be too general, as it could also describe a player’s position in an orchestra, specifying “[X]” in the prompt reduces ambiguity, making it an accurate [Fully Matching] template to describe the relation.

DatasetFluencyTemplate \subseteq RelationRelation \subseteq TemplateRelation \cap Template > 0Total Average
LAMA-UHN1.732.97613.707
PARAREL0.99.790.905.9853.670
LPAQA0.57.220.345.4051.540
MyriadLAMA1.770.830.9853.585

B.2.3 Evaluation result and analysis

The comparison includes the 4 dataset: LAMA-UHN, PARAREL, LPAQA and MyriadLAMA.Considering the amount of all the templates in the three models (6654 templates in total), we randomly sample 200 templates for multi-prompt probing datasets and use all 41 templates in LAMA-UHN for evaluation.To ensure objectivity, we anonymize the source of each template and mix them together for annotator (the first author).We publicize the annotation results here303030https://anonymous.4open.science/r/belief-CC8A.The evaluation result is shown in Table10.

From the Table10, We observe that our semi-automatically generated relational templates achieve quality comparable to manually created datasets like LAMA-UHN and PARAREL, while being 100 times larger than LAMA-UHN and 13.7 times larger than LPAQA. MyriadLAMA significantly outperforms LPAQA in template quality due to our two-stage template creation method.

Furthermore, Figure6 shows the score distributions for 200 templates across the three multi-prompt datasets.It reveals that LPAQA has many low-score templates, with 0 being the most common score.Compared to PARAREL, MyriadLAMA has slightly more templates with a score of 3 but slightly fewer with a score of 4, resulting in slightly lower overall quality.

What Matters in Memorizing and Recalling Facts? Multifaceted Benchmarks for Knowledge Probing in Language Models (6)

Appendix C Ablation Analysis of MyriadLAMA

In this section, we conduct ablation analysis MyriadLAMA to understand the validity of diversification on entities and templates.

C.1 Validity of extended entity expressions

We evaluate the validity of the extended entity expressions in MyriadLAMA by checking if these extensions cover facts that PLMs can capture but are missed in LAMA-UHN due to strict entity expression limitations. We conduct this analysis on BERT models, focusing on facts with extended subject and object expressions. MyriadLAMA contains 13,123 facts with extended subjects and 23,195 facts with extended objects. We measure the rate at which extended subjects/objects achieve higher ranks than the original expressions in the token distribution output.

The results, shown in the Table11 below, indicate that around 50% of extended subjects and 20% of extended objects achieve higher ranks than the original entities. This suggests that many facts are missed in LAMA-UHN and other single-expression factual knowledge probing datasets.

PLMsSubjectObject
BERTbase.5355.2107
BERTlarge.5358.2116
BERTwwm.5272.1853

C.2 Validity of paraphrased templates

In this section, we evaluate the validity of the relation templates in MyriadLAMA.We investigate the accuracy of each template and compare the accuracies between templates in LAMA-UHN, manually rewritten templates and auto-generate templates.Specifically, for each relation, we evaluate the accuracy (Acc@1) of all relation templates separately, and then calculate the minimum, and maximum accuracies among all templates for each relation.We then measure the dataset-level minimum/maximum accuracy by micro-averaging the template set with the minimum/maximum template accuracies (41 templates in each set).Finally, all of the template-specific accuracies are then micro-averaged to compute the average Acc@1.As indicated in Table 12, while the quality of MyriadLAMA’s prompts significantly varies, the high-quality prompts are notably superior to those of LAMA-UHN.Although the average accuracy of MyriadLAMA is lower than that of LAMA-UHN, it is considered that this is because MyriadLAMA uses relation templates that have been semi-automatically created, whereas LAMA-UHN uses carefully selected entities and templates.

PLMsLAMA-UHNMyriadLAMA
MinMaxMean
BERTbase.2403.0000.3534.1103
BERTlarge.2454.0007.3728.1185
BERTwwm.2448.0015.3695.1453
PLMsConsist\uparrowAcc@1 range
(min/max)
SubjectRelationSubjectRelation
BERTbase.5745.1504.0673/.1441.0000/.3534
BERTlarge.5497.1548.0714/.1554.0007/.3728
BERTwwm.5005.1057.0831/.1884.0015/.3695

C.3 What matters to robustness? Diverse subject vs. templates

Next, we aim to investigate the factors contributing to varying performance and inconsistent predictions of prompts.MyriadLAMA creates diverse prompts for each fact by combining different subject expressions and templates.To gauge their impact on robustness, we examine both the consistency (Consist) and the accuracy range (min/max) across various expressions of subjects or relations, assessed individually.To achieve this, the complete set of prompts was partitioned into multiple subsets, with each subset containing only one expression for each unique subjects or relations.The Acc@1 of the prompts obtained in this manner is then evaluated using different variants of BERT.

The results in Table13 indicate that while the accuracy range (min/max) and consistency (Consist) caused by aliases of subjects is less pronounced compared to diverse expressions of relational templates, its effect on factual knowledge evaluation remains significant.These findings highlight the vulnerability of factual knowledge evaluation based on single prompts and underscore the significance of harnessing the diversity of prompts within MyriadLAMA for robust assessments.

C.4 Manually rewritten vs. auto-generated templates

Upon comparing relational templates generated through manual rewriting and GPT-4 auto-generation, we find that auto-generated templates exhibit comparable quality (accuracy) to manually rewritten templates; they also demonstrate less diversity in acquiring different predictions, aligning with our expectations.

To assess the validity of LLM-generated templates for knowledge probing, we rank the accuracies (Acc@1) of manually created templates against those generated by LLMs.Specifically, for each relation, we rank the 5 manual templates among all 100 templates and calculate the average rank across all manually created templates for all relations.Table14 shows the average Acc@1 ranks of manual templates among 100 templates on BERTbase, BERTlarge, BERTwwm. They are 47.40, 45.64, and 44.80, respectively. These values closely approximate the average rank of 50, indicating that auto-generated templates can achieve nearly the same accuracy as manually created templates.

Furthermore, we quantify the diversity discrepancy between manually written and auto-generated templates. We categorize the auto-generated templates, including the original ones, as one group, resulting in five groups for each relation, each comprising 20 templates.Subsequently, we evaluate the similarity between templates within the same group and across different groups using the consistency measure (Consist), as presented in Table14.The consistency among prompts within the same group (inner-group) is notably high, whereas prompts from different groups (inter-group) exhibit less diversity in predictions.This underscores the significance of manual phrase rewriting, which can yield more diverse prompts and facilitate a more comprehensive evaluation.

PLMsAverage rank of
manual prompts
based on Acc@1
Consist
Inner-groupInter-group
BERTbase47.40.2904.1065
BERTlarge45.64.2884.1125
BERTwwm44.80.2387.0630

Appendix D QA-Style ICL and Its Evaluation

D.1 QA-style instruction

Beside the mask-prediction-style (MP-style) ICL task, we also defined and evaluate a question-answer-style (QA-style) ICL task utilizing the QA-style relational templates available in MyriadLAMA.This is available because MyriadLAMA provides 20 QA-style templates for each relation, offering not only syntactical diversity but also suitability for the autoregressive generation process in LLMs. Each QA-style prompt adheres to a format where the subject and relation construct the question, and the object corresponds to the answer, such as ‘‘Who developed [X]? [Y].’’For the QA prompt, we employ the few-shot prompt comprising X𝑋Xitalic_X random QA pairs, following the format outlined in InstructGPTOuyang etal. (2022).Given that all objects in MyriadLAMA are intended to be matched with single words, we append the instruction ‘‘Answer each question in one word.’’ to ensure compatibility.

Given the limited number of templates (20 for each relation) in the QA-style, the evaluation of QA-style prompts represents only one-fifth of the full prompts in MyriadLAMA.

D.2 Evaluation

We measure the ability of QA-style prompts in adhering to instructions and compare it to MP-style prompts.To ensure a fair comparison between QA- and MP-style ICL, we conduct evaluations using shared templates in both settings on Llama2-7B, with 20 QA-style templates for each relation.

We evaluate the abilities of fact prediction and one-word generation individually on Llama2-7B using average Acc@1 and rate of the one-word generation.As demonstrated in Table4, Llama2-7B exhibits a remarkable capability to comprehend instructions for answering questions and generating one-word answers.We observe that QA-style instructions perform better under the zero-shot setting, likely due to decoder-based PLMs’ ability to autoregressively generate text.However, this gap diminishes with the use of few-shot examples. This suggests that while MP-style prompts may slightly underestimate the knowledge in LLMs in zero-shot settings, MP-style ICL settings can demonstrate comparable or even superior performance in factual knowledge prediction compared to QA-style ICL prompts.

ICL settings Fact prediction(Acc@1)1-word ratio
QAMPQAMP
zero-shot.4534.5066.5285.4802
4-random.5429.5591.7996.8058
4-relation.6582.6649.9187.9246
4-template.6687.6765.9216.9266

Appendix E Examples of BELEIF-ICL Prompts

In this section, we provide example prompts for the four patterns introduced in §3: zero-shot, X-random, X-relation, and X-template. We focus on examples where X equals 4, which is the primary setting used in our work.

E.1 zero-shot

Predict the [MASK] in each sentence in one word.
Q: [MASK] consists of LAUPT.
A:

E.2 4-random

Predict the [MASK] in each sentence in one word.
Q: [MASK] is the administrative center of Jiangsu.
A: Nanjing.
Q: Mar del Plata and [MASK] are sister cities that have been developing together.
A: Havana.
Q: Malawi has established diplomatic ties with [MASK].
A: Australia.
Q: Which country is House of Representatives located? [MASK].
A: Libya.
Q: [MASK] consists of LAUPT.
A:

E.3 4-relation

Predict the [MASK] in each sentence in one word.
Q: What is the overarching group for Panzer Division Kempf? [MASK].
A: Wehrmacht.
Q: To whom does Mount Bulusan relate? [MASK].
A: Luzon.
Q: Who is responsible for Army National Guard? [MASK].
A: National Guard.
Q: What group is pharmacy a part of? [MASK].
A: biology.
Q: [MASK] consists of environmental factors.
A:

E.4 4-template

Predict the [MASK] in each sentence in one word.
Q: [MASK] consists of Panzer Division Kempf.
A: Wehrmacht.
Q: [MASK] consists of Mount Bulusan.
A: Luzon.
Q: [MASK] consists of Army National Guard.
A: National Guard.
Q: [MASK] consists of pharmacy.
A: biology.
Q: [MASK] consists of environmental factors.
A:

Appendix F Experimental Details

In this section, we list the detailed information of PLMs used in our study, including 3 encoder-based models and 8 decoder-based LLMs.

F.1 Model cards

Here are the links from Hugging Face to load each model:

BERTbase:
BERTlarge:
BERTwwm:
ALBERTbase:
ALBERTlarge:
Llama2-7B:
Llama2-7B-IT:
Llama2-13B:
Llama2-13B-IT:
Llama2-70B:
Llama2-70B-IT:
Llama3-8B:
Llama3-8B-IT:
Llama3-70B:
Phi3-mini:
Phi3-small:
Phi3-medium:
LLMsArchitectureIT†Model sizePre-training corpora
SizeResource
BERTbaseEncoder-basedNo110M3.3B words3.3B words3.3B words3.3B words3.3B words}cases3.3B words3.3B words3.3B words3.3B words3.3B words\left.\begin{array}[]{@{}r@{}}\text{3.3B words}\\\text{3.3B words}\\\text{3.3B words}\\\text{3.3B words}\\\text{3.3B words}\end{array}\right\}start_ARRAY start_ROW start_CELL 3.3B words end_CELL end_ROW start_ROW start_CELL 3.3B words end_CELL end_ROW start_ROW start_CELL 3.3B words end_CELL end_ROW start_ROW start_CELL 3.3B words end_CELL end_ROW start_ROW start_CELL 3.3B words end_CELL end_ROW end_ARRAY }BookCorpus (11,038 unpublished books) andEnglish Wikipedia(excluding lists, tables, and headers)
BERTlargeEncoder-basedNo336M
BERTwwmEncoder-basedNo336M
ALBERTbaseEncoder-based*No11.8M
ALBERTlargeEncoder-based*No223M
Llama2-7BDecoder-basedNo7B2.0T tokens2.0T tokens2.0T tokens2.0T tokens2.0T tokens2.0T tokens}cases2.0T tokens2.0T tokens2.0T tokens2.0T tokens2.0T tokens2.0T tokens\left.\begin{array}[]{@{}r@{}}\text{2.0T tokens}\\\text{2.0T tokens}\\\text{2.0T tokens}\\\text{2.0T tokens}\\\text{2.0T tokens}\\\text{2.0T tokens}\end{array}\right\}start_ARRAY start_ROW start_CELL 2.0T tokens end_CELL end_ROW start_ROW start_CELL 2.0T tokens end_CELL end_ROW start_ROW start_CELL 2.0T tokens end_CELL end_ROW start_ROW start_CELL 2.0T tokens end_CELL end_ROW start_ROW start_CELL 2.0T tokens end_CELL end_ROW start_ROW start_CELL 2.0T tokens end_CELL end_ROW end_ARRAY }Publicly available online data(excluding sites containing personal info;factual knowledge sources are upsampled)
Llama2-13BDecoder-basedNo13B
Llama2-70BDecoder-basedNo70B
Llama2-7B-ITDecoder-basedYes13B
Llama2-13B-ITDecoder-basedYes13B
Llama2-70B-ITDecoder-basedYes70B
Llama3-8BDecoder-basedNo8B15T+ tokens15T+ tokens15T+ tokens}cases15T+ tokens15T+ tokens15T+ tokens\left.\begin{array}[]{@{}r@{}}\text{15T+ tokens}\\\text{15T+ tokens}\\\text{15T+ tokens}\end{array}\right\}start_ARRAY start_ROW start_CELL 15T+ tokens end_CELL end_ROW start_ROW start_CELL 15T+ tokens end_CELL end_ROW start_ROW start_CELL 15T+ tokens end_CELL end_ROW end_ARRAY }Publicly available online data(details unknown, code is 4x larger than Llama2)
Llama3-8B-ITDecoder-basedNo8B
Llama3-70BDecoder-basedNo70B
Phi3-miniDecoder-basedYes3.8B4.9T tokens4.9T tokens4.9T tokens}cases4.9T tokens4.9T tokens4.9T tokens\left.\begin{array}[]{@{}r@{}}\text{4.9T tokens}\\\text{4.9T tokens}\\\text{4.9T tokens}\end{array}\right\}start_ARRAY start_ROW start_CELL 4.9T tokens end_CELL end_ROW start_ROW start_CELL 4.9T tokens end_CELL end_ROW start_ROW start_CELL 4.9T tokens end_CELL end_ROW end_ARRAY }High-quality materials including educational data,textbook-like generated text, high-quality chats
Phi3-smallDecoder-basedYes7B
Phi3-mediumDecoder-basedYes14B
†Specify whether the model is an instruction-tuned model or not
*ALBERT shares parameter between token embeddings and transformer layers to compress parameters.

F.2 Model differences

We outline the differences between PLMs in their pre-training details in Table16, including the type of Transformer architecture, model size, and the size and resources of the pre-training corpora.

F.3 Evaluation results on all PLMs based on BELIEFs

We present all evaluation results and their computational costs in this section. In Table17, we report the full-scale experiments using all the prompts provided by MyriadLAMA. This includes PLMs with 8B parameters or fewer, such as BERTbase, BERTlarge, BERTwwm, ALBERTbase, ALBERTlarge, Llama2-7B, Llama3-8B, Llama2-7B-IT, Llama3-8B-IT, Phi3-mini (3.8B), and Phi3-small (7B).For decoder-based models, we conduct experiments on 4 types of ICL settings.

For PLMs with more than 8B parameters, we report the evaluation results using partial prompts from MyriadLAMA, specifically using manually-rewritten templates (5 per relation), which account for 1/20 of the prompts compared to the full-scale experiments.Meanwhile, we run these models on only two ICL settings: zero-shot and 4-template.To ensure fair comparison with models having 8B parameters or fewer, we apply the same settings to all other decoder-based LLMs.The result is shown in Table18.We also list the approximate runtime for each experiments in these two tables.The experiments are all run on the NVIDIA RTX 6000 Ada.For experiments using model less than or equal to 8B parameters, we use single GPU to measure the consumption time.We use 2 GPUs for Llama2-13B and Phi3-small and 4 GPUs for 70B models.

Furthermore, we display the calibration level between accuracy and confidence for a straightforward inspection on Ovconf metrics.We show the calibration figures of models with full-scale experiments in Figure7 and experiments with partial prompts in Figure8.

PLMsAcc@1Fluctuation\downarrowConsist\uparrowOvconf1-word ratio\uparrowRuntime
rangeSD
BERTBERTbase.1095.1534.0217.1682.2154N/A6.3h
BERTlarge.1102.1574.0220.1713.2052N/A7.4h
BERTwwm.1364.1517.0208.1524.1000N/A7.4h
ALBERTALBERTbase.0362.0668.0131.1333.1647N/A6.1h
ALBERTlarge.0974.1110.0148.0821.0553N/A15.2h
Llama2-7Bzero-shot.3385.2602.0299.1269-.1119.475246.4h
4-random.4816.2250.0270.2312-.0894.824747.8h
4-relation.6286.1221.0150.3753-.1335.906047.8h
4-template.6616.0294.0036.4163-.0933.929947.8h
Llama2-7B-ITzero-shot.2925.1980.0253.1151.2605.906946.4h
4-random.4334.1958.0229.2128.2410.908147.8h
4-relation.5576.0791.0092.3341.1900.931447.8h
4-template.5896.0439.0050.3687.2061.938047.8h
Llama3-8Bzero-shot.3427.2864.0350.0240-.1329.157244.9h
4-random.5205.2033.0273.2156-.0796.814782.1h
4-relation.6871.1236.0156.3659-.0783.907182.1h
4-template.7268.0220.0026.4015-.0582.918782.1h
Llama3-8B-ITzero-shot.3578.2213.0262.1660.1402.792544.9h
4-random.4290.2068.0222.2137.1038.851182.1h
4-relation.5727.0731.0092.3239.0760.914082.1h
4-template.6508.0372.0040.3727.0800.933182.1h
Phi3-mini (3.8B)zero-shot.3498.2374.0292.1465.1752.864130.7h
4-random.4193.2324.0269.1649.1189.818432.9h
4-relation.5686.1440.0164.2818.0755.876932.9h
4-template.6067.0510.0048.3612.0887.880832.9h
Phi3-small (7B)zero-shot.4258.2437.0292.1782.2171.888382.4h
4-random.4889.2170.0276.2070.1670.8913148h
4-relation.6339.1012.0129.3361.1252.9287148h
4-template.6612.0360.0043.3626.1279.9411148h
PLMsAcc@1\uparrowFluctuation\downarrowConsist\uparrowOvconf1-word ratio\uparrowRuntime
RangeStedv
zero-shotPhi3-mini (3.8B).4248.1880.0247.2066.1609.85961.54h
Phi3-small (7B).4881.1900.0244.2284.1985.89964.12h
Llama2-7B.4311.2014.0249.1932-.0922.55582.32h
Llama2-7B-IT.3566.1862.0228.1932.2417.89612.32h
Llama3-8B.4224.2820.0353.1269-.1438.17862.45h
Llama3-8B-IT.4279.1962.0217.2337.1260.91792.45h
Llama2-13B.4785.2131.0260.1437-.1673.31854.84h
Llama2-13B-IT.4639.1701.0222.2358.2180.75424.84h
Phi3-medium (14B).5173.2123.0277.6167.2316.77594.85h
Llama2-70B.5675.2126.0280.2598-.0988.623928.97h
Llama2-70B-IT.5223.2055.0259.2489.1608.789128.97h
Llama3-70B.5974.2137.0278.2290-.1438.779032.55h
4-templatePhi3-mini (3.8B).6106.0314.0039.3686.0911.90511.65h
Phi3-small (7B).6668.0306.0039.3666.1222.94137.40h
Llama2-7B.6699.0257.0034.4174-.0933.92992.39h
Llama2-7B-IT.6013.0368.0045.3629.2007.93722.39h
Llama3-8B.7316.0194.0025.4060-.1119.91904.10h
Llama3-8B-IT.6563.0252.0032.3752.0535.93154.10h
Llama2-13B.7080.0235.0031.4326-.0662.91904.23h
Llama2-13B-IT.6482.0301.0038.3656.1708.93414.23h
Phi3-medium (14B).7304.0207.0025.4009.0317.93503.88h
Llama2-70B.7784.0190.0024.4449-.0690.925621.99h
Llama2-70B-IT.7232.0258.0031.4226.1026.958221.99h
Llama3-70B.8211.0139.0017.4636-.0812.937843.10h
What Matters in Memorizing and Recalling Facts? Multifaceted Benchmarks for Knowledge Probing in Language Models (7)
What Matters in Memorizing and Recalling Facts? Multifaceted Benchmarks for Knowledge Probing in Language Models (8)
What Matters in Memorizing and Recalling Facts? Multifaceted Benchmarks for Knowledge Probing in Language Models (9)
What Matters in Memorizing and Recalling Facts? Multifaceted Benchmarks for Knowledge Probing in Language Models (10)
What Matters in Memorizing and Recalling Facts? Multifaceted Benchmarks for Knowledge Probing in Language Models (11)
What Matters in Memorizing and Recalling Facts? Multifaceted Benchmarks for Knowledge Probing in Language Models (12)
What Matters in Memorizing and Recalling Facts? Multifaceted Benchmarks for Knowledge Probing in Language Models (13)
What Matters in Memorizing and Recalling Facts? Multifaceted Benchmarks for Knowledge Probing in Language Models (14)
What Matters in Memorizing and Recalling Facts? Multifaceted Benchmarks for Knowledge Probing in Language Models (15)
What Matters in Memorizing and Recalling Facts? Multifaceted Benchmarks for Knowledge Probing in Language Models (16)
What Matters in Memorizing and Recalling Facts? Multifaceted Benchmarks for Knowledge Probing in Language Models (17)
What Matters in Memorizing and Recalling Facts? Multifaceted Benchmarks for Knowledge Probing in Language Models (18)
What Matters in Memorizing and Recalling Facts? Multifaceted Benchmarks for Knowledge Probing in Language Models (19)
What Matters in Memorizing and Recalling Facts? Multifaceted Benchmarks for Knowledge Probing in Language Models (20)
What Matters in Memorizing and Recalling Facts? Multifaceted Benchmarks for Knowledge Probing in Language Models (21)
What Matters in Memorizing and Recalling Facts? Multifaceted Benchmarks for Knowledge Probing in Language Models (22)
What Matters in Memorizing and Recalling Facts? Multifaceted Benchmarks for Knowledge Probing in Language Models (23)
What Matters in Memorizing and Recalling Facts? Multifaceted Benchmarks for Knowledge Probing in Language Models (24)
What Matters in Memorizing and Recalling Facts? Multifaceted Benchmarks for Knowledge Probing in Language Models (25)
What Matters in Memorizing and Recalling Facts? Multifaceted Benchmarks for Knowledge Probing in Language Models (26)

F.4 Knowledge coverage rate on all PLMs

We present the average, maximum, and upper limit knowledge coverage rates, as introduced in §7, for all PLMs evaluated using all templates.The results are shown in Figure19.

PLMsAverageMaximumUpper Limit
BERTBERTbase.1095.4248.6209
BERTlarge.1102.4451.6556
BERTwwm.1364.4501.6636
ALBERTALBERTbase.0362.2175.3405
ALBERTlarge.0974.3746.5979
Llama2-7Bzero-shot.3385.6577.8153
4-random.4816.7026.8587
4-relation.6286.7179.8475
4-template.6616.7197.8133
Llama3-8Bzero-shot.3427.7099.8756
4-random.5205.7339.8867
4-relation.6871.7733.8934
4-template.7268.7731.8628
Phi3-minizero-shot.3498.6346.8381
4-random.4193.6506.8423
4-relation.5686.6791.8436
4-template.6067.6754.8114
Phi3-smallzero-shot.4258.6828.8642
4-random.4889.7037.8695
4-relation.6339.7172.8507
4-template.6612.7181.8346
IDHuman-rewritten templatesGPT-4 paraphrased templates
P19[X] started their life in [Y].[X] took their first steps of life in [Y].
[X] activated their life’s beginning in [Y].
[X] initiated their journey of life within [Y].
The birth of [X] occurred in [Y].The origin of [X] took place in [Y].
The inception of [X] was within [Y].
It was in [Y] that [X] first made its appearance.
P20[X] spent the last years of life in [Y].In [Y], [X] spent the end of their life.
[X]’s final era was in [Y].
In [Y], [X]’s life came to a close.
[Y] is the last place where [X]lived until death. [X] inhabited [Y] up until death.
[Y] was the end-of-life dwelling for [X].
[Y] served as the last dwelling for [X] before they died.
P279Of which class is [X] a subclass? [Y].What is the general class that [X] is a part of as a subclass? [Y].
What larger class encompasses [X] as a subclass? [Y].
Into which class is [X] categorized as a subclass? [Y].
[X] is also necessarily a [Y].[X] is intrinsically a [Y].
[X] is fundamentally a [Y].
[X] is by definition a [Y].
P37[Y] is spoken as an official languageby people in [X]. [Y] is the authorized language for formal use in [X].
The official language spoken by individualsin [X] is [Y].
[X] endorses [Y] as the language forstate-related communication.
Officially, the people living in [X]use the language [Y] for communication. In [X], the standard language for dialogue among the populace is [Y].
Residents of [X] typically converse in [Y].
The official medium of verbal exchange in [X] is the [Y] language.
P413[X] was given the [Y] job.[X] was selected for the [Y] position.
[X] was named the new [Y].
The [Y] duties have been allocated to [X].
[X] is a famous [Y] player.[X] has risen to fame with their [Y] playing abilities.
[X] is well-known for playing [Y].
[X] is notable for their expertise in [Y].
P449[X] premiered on the network [Y].[Y] was the origin of the broadcast for [X].
[X] was initially broadcasted by [Y].
The debut broadcast of [X] was on [Y].
[Y] is the first air channel of [X].[X] was originally brought to the public by [Y].
[X] first hit the airwaves courtesy of [Y].
[X] first reached listeners and viewers via [Y].
P47[X] and [Y] are neighboring countries.[X] and [Y] are countries that are in close proximity.
[Y] lies in the vicinity of [X].
[Y] and [X] are countries that share a boundary.
You can go through [X] to reach [Y].[X] acts as a gateway to [Y].
To reach [Y], one can travel through [X].
Traveling over [X] can bring you to [Y].
IDHuman-rewritten templatesGPT-4 paraphrased templates
P138Who or what is [X] namedafter? [Y]. Who is the namesake behind [X]? [Y].
What is the etymology behind [X]’s name? [Y].
Who or what was [X] called after? [Y].
[X] is named after [Y].[X] takes its name from [Y].
[Y] is the inspiration behind the name of [X].
[X] holds the name given in tribute to [Y].
P364[X] is created in language [Y].[X] was composed in the [Y] language.
[X] unfolds in the language known as [Y].
[X] is expressed through the language of [Y].
[X] was written in the [Y]language. The words of [X] are in the [Y] language.
The composition of [X] is in the [Y] language.
[X] was created using the [Y] language.
P463[X] served for [Y].[X] took part in [Y].
[X] collaborated with [Y].
[X] held a position at [Y].
Which group or organizationdoes [X] belong to? [Y]. [X] is part of what organization? [Y].
[X] is a member of which entity? [Y].
Can you tell me which entity [X] is amember of? [Y].
P101Which field does [X] work in?[Y]. In what industry is [X] employed? [Y].
[X] holds a position in which field? [Y].
[X] is a professional in what sector? [Y].
[X] is influential in the domainof [Y]. The domain of [Y] feels the considerable impact of [X].
[X] plays a pivotal role in the sphere of [Y].
[X] has a profound effect on [Y].
P106[X] is famous forachievements as a [Y]. [X] is well-known for their accomplishments in the [Y] role.
[X] is well-known for their successful career as a [Y].
[X] is a celebrated [Y] with a long list of achievements.
[X] is a [Y] by profession.[X] has built a career as a [Y].
[X] is employed as a [Y].
[X] carries out the role of a [Y].
P527[Y] is a member of [X].[X] contains [Y] as part of its composition.
[Y] holds a place in [X].
[Y] is a piece of [X].
[Y] belongs to [X].[Y] is held by [X].
[X] has [Y] under its ownership.
[Y] is one of the items owned by [X].
P530[Y] is one of the countries [X]has diplomatic relations with. [Y] is a member of the group of countries with which [X] conducts diplomacy.
[X] has a formal diplomatic relationship with [Y], as it does with several other countries.
[Y] is recognized by [X] as a diplomatic partner among other nations.
[X] has established diplomaticties with [Y]. [X] has initiated formal diplomatic relations with [Y].
[X] and [Y] have begun a diplomatic relationship.
[X] and [Y] have set up official diplomatic links.
IDHuman-rewritten templatesGPT-4 paraphrased templates
P176[X] is a product of [Y]’s manufacturing.The entity [Y] crafts and produces [X].
The item [X] is fabricated by [Y].
[X] is brought to life by [Y]’s manufacturing capabilities.
Which company produced [X]? [Y].Can you tell me who made [X]? [Y].
Which producer can be linked to [X]? [Y].
What is the producing company of [X]? [Y].
P27[X] is a person from country [Y].[X] is a resident of [Y].
[X] bears the nationality of [Y].
[X] is a product of [Y].
The nationality of [X] is [Y].[X] is a native of [Y].
[X] is identified as a national from [Y].
[Y] is the country of origin for [X].
P407[X] is in language [Y].The language of [X] is [Y].
The primary linguistic expression of [X] is in [Y].
[X] is articulated through the [Y] language.
[X] is a work in the [Y] language.The [Y] language is the linguistic fabric of [X].
[X] has been produced using the [Y] language.
[X] is an example of literature in the [Y] language.
P30On what continent is [X] located? [Y].What’s the name of the continent that [X] calls home? [Y].
What continental landmass does [X] occupy? [Y].
[X] lies on which of the Earth’s continents? [Y].
[X] is a part of the continent [Y].[X] is a section of the continental land of [Y].
[X] is geographically positioned as part of continent [Y].
[X] is an integral piece of the continent [Y].
P178[X] was originally created by [Y].The foundation of [X] was laid by [Y].
The concept of [X] was conceived by [Y].
[X] first came into existence thanks to [Y].
[X] is developed by [Y].[Y] has developed [X].
[Y] is the developer behind [X].
[Y] stands as the creator of [X].
P1376[X] is the capital of [Y].[Y]’s governmental seat is in [X].
[X] is recognized as the official capital of [Y].
The leading city and capital of [Y] is [X].
[X] is the administrative center of [Y].[Y]’s administrative leadership is situated in [X].
[Y]’s administrative affairs are managed from [X].
[X] is where [Y]’s administrative management is anchored.
P131[Y] is the place where [X] is located.[X] resides in [Y].
[X] can be found at the location of [Y].
[X] is anchored in [Y].
[X] is located in [Y].[Y] is where [X] is established.
[Y] contains [X].
[Y] houses [X].
IDHuman-rewritten templatesGPT-4 paraphrased templates
P1412What language does [X] use? [Y].[X] communicates in what vernacular? [Y].
What tongue does [X] utilize? [Y].
What is the primary language for [X]? [Y].
[Y] is the language that is used by [X].The tongue of [X] is the language [Y].
[X] uses [Y] as its mode of speech.
[Y] is the language that enables communication for [X].
P108[X] is employed by [Y].[Y] is the employer of [X].
[X] has a job at [Y].
[Y] is the source of employment for [X].
Who does [X] work for? [Y].Who does [X] report to in their job? [Y].
For whom is [X] currently working? [Y].
Who holds [X] on their team? [Y].
P136What is the genre of [X]? [Y].In terms of genre, how would you classify[X]? [Y].
What category of genre does [X] belong to? [Y].
In what genre category would you place [X]? [Y].
[X] is the representative of the [Y] style.[X] personifies the [Y] style in its purest form.
[X] is the epitome of the [Y] approach.
[X] is the archetype of the [Y] tradition.
P17Which country is [X] located? [Y].Can you identify the country where [X] is situated? [Y].
Could you specify the country of [X]’s location? [Y].
[X] can be located in what country? [Y].
[Y] is the country in which [X] is located.[Y] is the nation that houses [X].
[Y] encompasses the region where [X] can be found.
[Y] is the setting for the location of [X].
P39What position does [X] hold? [Y].What position does [X] occupy? [Y].
What is the employment status of [X]? [Y].
What is the position title for [X]? [Y].
[X] was sworn in as [Y].[X] has been designated the official role of [Y].
[X] pledged their commitment to the role of [Y].
[X] was confirmed in the role of [Y].
P264Which music label represents [X]? [Y].Which label has [X] on its roster? [Y].
Who is [X]’s music label? [Y].
With whom is [X] signed for music production? [Y].
[X] is represented by music label [Y].The music label acting on behalf of [X] is [Y].
[Y] is the music label that has signed [X].
[X] has music label [Y] as its representative.
P276Where is [X] located? [Y].What’s the location of [X]? [Y].
Where can [X] be found? [Y].
Where should I look for [X]? [Y].
[X] is located in [Y].[X] is positioned in [Y].
[X] occupies a space in [Y].
[Y] contains [X].
IDHuman-rewritten templatesGPT-4 paraphrased templates
P937[Y] is the place where [X] worked.[X] had their employment based in [Y].
[X] found their employment setting in [Y].
[X] conducted their professional activities in [Y].
[X] had work activity in [Y].[X] took part in business tasks in [Y].
[X] was employed within the confines of [Y].
[X] was operational in the workforce at [Y].
P140Which religion is [X] affiliated with? [Y].What religious belief does [X] adhere to? [Y].
Which spiritual path is embraced by [X]? [Y].
What is the creed of [X]? [Y].
[X] is affiliated with the [Y] religion.[X] is part of the [Y] religious denomination.
[X] is associated with the [Y] spiritual tradition.
[X] adheres to the [Y] religion.
P1303[X] is a [Y] player.[X] specializes in the [Y].
[X] is a seasoned [Y] player.
[X] is a [Y] specialist.
[X] plays [Y].[X] expresses their musicianship through [Y].
[X] has chosen [Y] as their musical companion.
[X] is a musician who specializes in [Y].
P127Who owns [X]? [Y].Whose property is [X] considered to be? [Y].
Who is the legal holder of [X]? [Y].
Who has the ownership rights to [X]? [Y].
[X] is owned by [Y].[Y] is the proprietor of [X].
[Y] holds the title to [X].
[Y] possesses [X].
P103[X] grew up speaking [Y] as their firstlanguage. [X]’s formative years were shaped by speaking [Y].
[X] started their life speaking [Y].
[X]’s childhood language was [Y].
[Y] is the mother tongue of [X].[X] has [Y] as their original tongue.
[X] was nurtured in an environment where [Y] is spoken.
[X] has [Y] as the language of their upbringing.
P190The city of [X] is twinned with [Y].[Y] and [X] have entered into a twinning arrangement.
[X] is in a twinning relationship with [Y].
A twinning link has been established between [X] and [Y].
[X] and [Y] are sister cities that havebeen developing together. [X] and [Y] have been sister cities on a shared developmental journey.
The cities of [X] and [Y] have jointly progressed as sister municipalities.
[X] and [Y] have been in lockstep as sister cities in their development.
P1001[X] applies to the jurisdiction in [Y].The jurisdiction of [Y] encompasses [X].
[X] is answerable to the legal system in [Y].
[Y] exercises legal control over [X].
The region of [Y] uses [X] as a legal term.[X] is a term with legal standing in [Y].
The legal system of [Y] includes [X] as an official term.
[X] is employed as a juridical term in [Y].
IDHuman-rewritten templatesGPT-4 paraphrased templates
P31[X] is a [Y].[X] represents a [Y].
[X] is an example of a [Y].
[X] is termed a [Y].
Speaking of [Y], [X] is an example of it.[X] is a particular instance that reflects [Y].
[X] is a variant that falls within the scope of [Y].
[Y] can be demonstrated through [X].
P495[X] originates from the country of [Y].[X] was first found in the land of [Y].
The inception of [X] is linked to the country [Y].
The origin of [X] can be traced back to [Y].
[X] first appeared in [Y].[X] has its roots in [Y].
[X] was first crafted in [Y].
The origin of [X] is attributed to [Y].
P159The operation of [X] depends on theheadquarters in [Y]. [X]’s functioning is reliant on the main office in [Y].
The base in [Y] is essential for [X] to function.
The primary operations of [X] are contingent upon the base in [Y].
The headquarters of [X] is in [Y].[Y] is home to the central office of [X].
The top office of [X] is positioned in [Y].
The nerve center for [X]’s operations is based in [Y].
P36[Y] is the administrative center of [X].The nerve center for [X]’s administration is found in [Y].
[X]’s administrative governance is centralized in [Y].
[X] is administratively governed by [Y].
[Y] represents the capital city for [X].[Y] functions as [X]’s political hub.
[X] uses [Y] as its head city.
[X]’s administrative center is [Y].
P740[X] started their career in [Y].[Y] served as the starting point for [X]’s career.
[X] began earning their stripes in the field of [Y].
[X] commenced their employment journey with [Y].
The formation location of [X] is [Y].The assembly point for [X] is [Y].
[Y] is recognized as the setting for [X]’s formation.
[Y] is where [X] originates.
P361Which entity does [X] belong to? [Y].Who owns [X]? [Y].
What is the overarching group for [X]? [Y].
What organization encompasses [X]? [Y].
[Y] consists of [X].[X] is what [Y] is primarily made of.
[Y] incorporates [X] within it.
[Y] is structured with [X].
What Matters in Memorizing and Recalling Facts? Multifaceted Benchmarks for Knowledge Probing in Language Models (2025)
Top Articles
Latest Posts
Recommended Articles
Article information

Author: Eusebia Nader

Last Updated:

Views: 6585

Rating: 5 / 5 (80 voted)

Reviews: 87% of readers found this page helpful

Author information

Name: Eusebia Nader

Birthday: 1994-11-11

Address: Apt. 721 977 Ebert Meadows, Jereville, GA 73618-6603

Phone: +2316203969400

Job: International Farming Consultant

Hobby: Reading, Photography, Shooting, Singing, Magic, Kayaking, Mushroom hunting

Introduction: My name is Eusebia Nader, I am a encouraging, brainy, lively, nice, famous, healthy, clever person who loves writing and wants to share my knowledge and understanding with you.