Modelling Commonsense Commonalities with Multi-Facet Concept Embeddings (2024)

Hanane Kteich1, Na Li3, Usashi Chatterjee2, Zied Bouraoui1, Steven Schockaert2
1 CRIL CNRS & University of Artois, France  2 CardiffNLP, Cardiff University, UK
3 University of Shanghai for Science and Technology, China
{kteich,bouraoui}@cril.fr{chatterjee,schockaerts1}@cardiff.ac.uk
li_na@usst.edu.cn

Abstract

Concept embeddings offer a practical and efficient mechanism for injecting commonsense knowledge into downstream tasks. Their core purpose is often not to predict the commonsense properties of concepts themselves, but rather to identify commonalities, i.e.sets of concepts which share some property of interest. Such commonalities are the basis for inductive generalisation, hence high-quality concept embeddings can make learning easier and more robust. Unfortunately, standard embeddings primarily reflect basic taxonomic categories, making them unsuitable for finding commonalities that refer to more specific aspects (e.g.the colour of objects or the materials they are made of). In this paper, we address this limitation by explicitly modelling the different facets of interest when learning concept embeddings. We show that this leads to embeddings which capture a more diverse range of commonsense properties, and consistently improves results in downstream tasks such as ultra-fine entity typing and ontology completion.

Modelling Commonsense Commonalities with Multi-Facet
Concept Embeddings


Hanane Kteich1, Na Li3, Usashi Chatterjee2, Zied Bouraoui1, Steven Schockaert21 CRIL CNRS & University of Artois, France  2 CardiffNLP, Cardiff University, UK3 University of Shanghai for Science and Technology, China{kteich,bouraoui}@cril.fr{chatterjee,schockaerts1}@cardiff.ac.ukli_na@usst.edu.cn


1 Introduction

Many knowledge engineering tasks require knowledge about the meaning of concepts. As a motivating example, let us consider the problem of ontology expansion, which consists in uncovering properties of, and relationships between concepts, given the names of these concepts and an initial knowledge base. Despite the popularity of Large Language Models (LLMs), the use of concept embeddings remains attractive in such settings Vedula etal. (2018); Li etal. (2019); Malandri etal. (2021); Shi etal. (2023). Indeed, using LLMs directly is often impractical and highly inefficient, as ontologies can involve tens of thousands of concepts. Concept embeddings can also be integrated with structural features more easily Li etal. (2019), for instance by using them to initialise the node embeddings of a Graph Neural Network (GNN). Concept embeddings similarly play an important role in many multi-label classification tasks, especially in the zero-shot or few-shot regime Xing etal. (2019); Yan etal. (2021); Luo etal. (2021); Huang etal. (2022); Ma etal. (2022). As a representative example of such a task, we will consider the problem of ultra-fine entity typing Choi etal. (2018), which consists in assigning semantic types to mentions of entities, where a total number of around 10K candidate types are considered. In such tasks, the role of pre-trained concept embeddings is to inject prior knowledge about the meaning of the type labels Xiong etal. (2019); Li etal. (2023a). Note that we cannot straightforwardly accomplish this with LLMs, as they have been found to struggle with information extraction tasks Han etal. (2023). Moreover, scalability is often an important concern for information extraction systems, which further complicates the use of LLMs.

We take the view that concept embeddings, in the aforementioned applications, are primarily needed to capture commonalities among the concepts involved. For ontology expansion, this is true by definition, since the task explicitly involves identifying sets of concepts that have some property in common. For ultra-fine entity typing, Li etal. (2023a) reported that directly using pre-trained label embeddings was challenging. Instead, they proposed to cluster the set of labels based on pre-trained concept embeddings, and to use the resulting clusters to structure the label space. The idea of using embeddings to structure the label space also lies at the heart of many traditional approaches for zero-shot and few-shot classification.

A key limitation of traditional concept embeddings comes from the fact that they primarily reflect basic taxonomic categories. For instance, the embedding of banana is typically similar to that of other fruits, but dissimilar from the embeddings of other yellow things. Some authors have proposed to learn multi-facet embeddings as a way of alleviating these concerns Rothe and Schütze (2016); Jain etal. (2018); Alshaikh etal. (2019, 2020). Essentially, rather than learning a single vector representation of each concept (or entity), they learn a fixed number of different vectors, each focusing on a different facet. However, learning such representations is challenging for two main reasons. First, learning multi-facet representations requires some kind of supervision signal about the facets of interest Locatello etal. (2019), which is not readily available for many domains. Second, existing approaches consider a fixed set of facets, which makes them unsuitable for open-domain settings. Indeed, the facets of interest strongly depend on the nature of the concepts involved. When modelling food, we may be interested in embeddings that capture their nutritional content. When modelling household appliances, we may want a representation that captures where in the house they are typically found. Rather than using the same set of facets for all concepts, we thus need a more dynamic representation framework.

In this paper, we propose a novel method for learning multi-facet concept embeddings based on two key ideas. First, we rely on ChatGPT111https://openai.com to collect a diverse set of (property, facet) pairs, such as (yellow, colour), (found in the kitchen, location) or (sweet, taste), allowing us to treat the problem of learning multi-facet embeddings as a supervised learning problem. Second, rather than learning several independent vector representations, we only learn a single embedding for each concept, treating facets instead as masks on the set of coordinates. This approach offers several modelling advantages, including the fact that facets can have a hierarchical structure (e.g.colour is a sub-facet of appearance) and the fact that we do not have to tune the number and dimension of the facets a priori. Specifically, we train three BERT Devlin etal. (2019) encoders: one encoder to map concepts onto their embedding, one encoder to map properties onto their embedding, and one encoder to map properties onto the embedding of the corresponding facet. We show that these encoders can be effectively trained using only training data obtained from ChatGPT, although the best results are obtained by augmenting this training data with examples from ConceptNet222https://conceptnet.io.

2 Related Work

Concept Embeddings

The idea that language models of the BERT family can be used for learning concept embeddings has been studied extensively. Some approaches simply use the name of the concept as input to the BERT encoded, possibly together with a short prompt Bommasani etal. (2020); Vulić etal. (2021); Liu etal. (2021a); Gajbhiye etal. (2022). Other approaches instead use contextualised representations from sentences mentioning the concept, selected from some corpus Ethayarajh (2019); Bommasani etal. (2020); Vulić etal. (2020); Liu etal. (2021b); Li etal. (2023b). These approaches have been developed with different motivations in mind. One common motivation is to learn something about the language model itself by inspecting the resulting concept embeddings, such as biases Bommasani etal. (2020) or the model’s grasp of lexical semantics Ethayarajh (2019); Vulić etal. (2020). Other authors have rather focused on the use of embeddings for predicting semantic properties of concepts Gajbhiye etal. (2022); Li etal. (2023b); Rosenfeld and Erk (2023). Our paper can be seen as a continuation of this latter research line, where we aim to improve the range of properties that can be captured by concept embeddings through the use of facet embeddings.

Commonalities

Gajbhiye etal. (2023) recently argued that the main purpose of concept embeddings, when it comes to downstream applications, is usually to identify what different concepts have in common. Specifically, given a set of concepts, they first used the corresponding concept embeddings to predict a set of properties for each concept. The resulting predictions were then filtered using a Natural Language Inference (NLI) model. Finally, properties that were found for at least two concepts where identified as shared properties. They showed, on the task of ultra-fine entity typing, that by augmenting the training data with these shared properties, models were able to generalise better. This idea also relates to the notion of conceptualisation He etal. (2022); Wang etal. (2023b). Essentially, the latter works have suggested to augment commonsense knowledge graphs by generalising the concepts involved. This often involves replacing a specific concept (e.g.a football game) by a description referring to some salient property (e.g.a relaxing event). Wang etal. (2023a) showed that the resulting generalisations of commonsense knowledge graphs were useful for zero-shot commonsense question answering. The aforementioned methods all rely on the availability of a set of properties (or hypernyms) that can be used to generalise a given set of concepts. In practice, however, it is hard to obtain comprehensive property sets, which means that many commonalities may not be discovered. Moreover, certain commonalities are hard to describe, even though they intuitively make sense.333As a simple toy example, among the set {cat,dog,goldfish,rabbit}catdoggoldfishrabbit\{\textit{cat},\allowbreak\textit{dog},\allowbreak\textit{goldfish},%\allowbreak\textit{rabbit}\}{ cat , dog , goldfish , rabbit } the concepts cat and dog stand out as similar, even though they are not the only pets nor the only mammals. To avoid such limitations, we identify commonalities by clustering concept embeddings.

Multi-Facet Embeddings

The idea of capturing different facets of meaning has been studied in the context of disentangled representation learning, especially in computer vision Chen etal. (2016); Higgins etal. (2017); Kim and Mnih (2018); Chen etal. (2018). When it comes to learning disentangled representation of text, He etal. (2017) proposed a method for learning aspect embeddings in the context of sentiment analysis, whereas several authors have proposed multi-facet document embeddings Jain etal. (2018); Risch etal. (2021); Kohlmeyer etal. (2021). Rothe and Schütze (2016) suggested that word embeddings could be decomposed into meaningful subspaces, which essentially correspond to facets. Most similar to our work, Alshaikh etal. (2019) proposed a method for decomposing a domain-specific concept embedding space into subspaces capturing different facets. To find this decomposition, they relied on the assumption that properties belonging to the same facet tend to have similar word embeddings. Finally, Alshaikh etal. (2020) proposed a mixture-of-experts model to learn multi-facet concept embeddings directly, using a variant of GloVe Pennington etal. (2014).

3 Proposed Approach

We propose a bi-encoder based concept embedding model which is capable of representing concepts w.r.t.different facets. The key stumbling block in previous work on learning multi-facet embeddings has been the difficulty in acquiring a meaningful supervision signal about which properties belong to the same facet (e.g. that large and small both refer to size). As explained in Section 3.1, we can now overcome this difficulty by collecting property-facet pairs from LLMs. Our proposed model itself is described in Section 3.2. Finally, we explain how facet-specific embeddings can be extracted once the model has been trained (Section 3.3).

3.1 Obtaining Training Data

We need two types of training examples for our model: concept-property judgements (e.g.banana has the property rich in potassium) and property-facet judgments (e.g.rich in potassium refers to nutritional content). We use two sources for obtaining these examples: ChatGPT and ConceptNet.

ChatGPT

We use the dataset of 109K concept-property judgments that were obtained from ChatGPT by Chatterjee etal. (2023).444Dataset available from https://github.com/ExperimentsLLM/EMNLP2023_PotentialOfLLM_LearningConceptualSpace. To obtain property-facet pairs, we proceed in a similar way, although obtaining suitable information about facets turned out to be more challenging. We obtained the best results with the following prompt, which does not ask about facets explicitly. Instead, we ask about concept-property pairs, but use a format which requires the model to specify the facet of each property that is generated:

  • I am interested in knowing common properties that are satisfied by different concepts.1. Sound: loud - thunder, jet engine, siren2. Temperature: cold - ice, refrigerator, Antarctica 3. Colour: orange - mandarin, basketball, clownfish4. Shape: round - sun, orange, ball5. Purpose: used for cleaning - broom, lemon, soap6. Location: located in the ocean - sand, whale, corals. Please provide me with a list of 30 such examples.

We used this request with the same prompt 10 times. After this, we change the examples that are given (shown in bold above) and repeat. We manually processed the responses to standardise facet spellings and removed duplicates. For instance, facets were sometimes generated in plural (e.g.colors rather than color), or the same facet was generated with different spellings (e.g.color and colour). Even when changing the examples in the prompt after every 10 requests, the number of unique facet-property pairs that were generated saturated relatively quickly. In total, we obtained 828 unique facet-property pairs, covering 127 unique facets.

ConceptNet

Starting from a ConceptNet 5 dump555https://github.com/commonsense/conceptnet5/wiki/Downloads, we first selected the English language triples. Given a triple such as (boat, at location, sea) we create a corresponding concept-property pair (boat, at location sea) and a property-facet pair (at location sea, at location). In other words, the ConceptNet relations are treated as facets, and properties are obtained by combining a relation with a tail concept. Not all ConceptNet relations are suitable for this purpose. We specifically used: RelatedTo, FormOf, IsA, UsedFor, AtLocation, CapableOf, HasProperty, HasA, InstanceOf and MadeOf. Furthermore, when creating properties, we only consider tail concepts that appear at least 10 times. We thus end up with 884 distinct properties, 10 facets, 18505 concepts, 884 property-facet pairs and 36955 concept-property pairs.

3.2 Model Formulation

Let us write 𝒟𝖼𝗉subscript𝒟𝖼𝗉\mathcal{D}_{\mathsf{cp}}caligraphic_D start_POSTSUBSCRIPT sansserif_cp end_POSTSUBSCRIPT for the set of (concept, property) pairs that are available for training. Similarly, we write 𝒟𝗉𝖿subscript𝒟𝗉𝖿\mathcal{D}_{\mathsf{pf}}caligraphic_D start_POSTSUBSCRIPT sansserif_pf end_POSTSUBSCRIPT for the set of available (property, facet) pairs. We build on the following bi-encoder loss from Gajbhiye etal. (2022):

=absent\displaystyle\mathcal{L}\,{=}caligraphic_L =(c,p)𝒟𝖼𝗉logσ(𝖢𝗈𝗇(c)𝖯𝗋𝗈𝗉(p))subscript𝑐𝑝subscript𝒟𝖼𝗉𝜎𝖢𝗈𝗇𝑐𝖯𝗋𝗈𝗉𝑝\displaystyle-\sum_{(c,p)\in\mathcal{D}_{\mathsf{cp}}}\log\sigma\big{(}\mathsf%{Con}(c)\cdot\mathsf{Prop}(p)\big{)}- ∑ start_POSTSUBSCRIPT ( italic_c , italic_p ) ∈ caligraphic_D start_POSTSUBSCRIPT sansserif_cp end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_σ ( sansserif_Con ( italic_c ) ⋅ sansserif_Prop ( italic_p ) )
(c,p)𝒩𝖼𝗉log(1σ(𝖢𝗈𝗇(c)𝖯𝗋𝗈𝗉(p)))subscript𝑐𝑝subscript𝒩𝖼𝗉1𝜎𝖢𝗈𝗇𝑐𝖯𝗋𝗈𝗉𝑝\displaystyle-\sum_{(c,p)\in\mathcal{N}_{\mathsf{cp}}}\log\big{(}1-\sigma\big{%(}\mathsf{Con}(c)\cdot\mathsf{Prop}(p)\big{)}\big{)}- ∑ start_POSTSUBSCRIPT ( italic_c , italic_p ) ∈ caligraphic_N start_POSTSUBSCRIPT sansserif_cp end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log ( 1 - italic_σ ( sansserif_Con ( italic_c ) ⋅ sansserif_Prop ( italic_p ) ) )

where 𝒩𝖼𝗉subscript𝒩𝖼𝗉\mathcal{N}_{\mathsf{cp}}caligraphic_N start_POSTSUBSCRIPT sansserif_cp end_POSTSUBSCRIPT is a set of negative examples. Specifically, for each positive example (c,p)𝑐𝑝(c,p)( italic_c , italic_p ), five negative examples (c,p)𝑐superscript𝑝(c,p^{\prime})( italic_c , italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) are obtained by replacing p𝑝pitalic_p by another property psuperscript𝑝p^{\prime}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. The concept embedding 𝖢𝗈𝗇(c)𝖢𝗈𝗇𝑐\mathsf{Con}(c)sansserif_Con ( italic_c ) and property embedding 𝖯𝗋𝗈𝗉(p)𝖯𝗋𝗈𝗉𝑝\mathsf{Prop}(p)sansserif_Prop ( italic_p ) are obtained by two separate BERT encoders. The concept encoder 𝖢𝗈𝗇𝖢𝗈𝗇\mathsf{Con}sansserif_Con uses a prompt of the form <Concept> means [MASK]. The property encoder 𝖯𝗋𝗈𝗉𝖯𝗋𝗈𝗉\mathsf{Prop}sansserif_Prop uses the same prompt. In both cases, the embeddings are obtained from the final-layer embedding of the [MASK] token. However, the concept embedding 𝖢𝗈𝗇(c)𝖢𝗈𝗇𝑐\mathsf{Con}(c)sansserif_Con ( italic_c ) is normalised (w.r.t.the Euclidean norm) whereas the property embedding 𝖯𝗋𝗈𝗉(p)𝖯𝗋𝗈𝗉𝑝\mathsf{Prop}(p)sansserif_Prop ( italic_p ) is not.

Previous work on multi-facet embeddings has relied on learning multiple concept embeddings, where each concept embedding only captures a subset of all properties Alshaikh etal. (2020). This approach has a number of drawbacks, however. For instance, it relies on the idea that each facet is represented using the same number of dimensions, implicitly assumes that the overall number of facets is relatively small, and that the facets are independent from each other. This is particularly problematic in open-domain settings, where a wide range of facets may need to be considered, certain facets only make sense for some concepts (e.g.nutritional value only makes sense for food) and facets often have a hierarchical structure (e.g.colour is a sub-facet of appearance). Therefore, instead of learning multiple concept embeddings, we instead interpret facets as masks on concept embeddings.

Specifically, we train a third BERT encoder, 𝖥𝖺𝖼𝖾𝗍𝖥𝖺𝖼𝖾𝗍\mathsf{Facet}sansserif_Facet, which also takes the property p𝑝pitalic_p as input and again uses the same prompt. The idea is that 𝖥𝖺𝖼𝖾𝗍(p)𝖥𝖺𝖼𝖾𝗍𝑝\mathsf{Facet}(p)sansserif_Facet ( italic_p ) indicates which coordinates of the concept embeddings are most relevant when modelling the property p𝑝pitalic_p.We define the masked embedding of concept c𝑐citalic_c w.r.t.some property p𝑝pitalic_p as follows:

𝖬𝖢(c,p)=𝖢𝗈𝗇(c)𝖥𝖺𝖼𝖾𝗍(p)𝖢𝗈𝗇(c)𝖥𝖺𝖼𝖾𝗍(p)𝖬𝖢𝑐𝑝direct-product𝖢𝗈𝗇𝑐𝖥𝖺𝖼𝖾𝗍𝑝normdirect-product𝖢𝗈𝗇𝑐𝖥𝖺𝖼𝖾𝗍𝑝\displaystyle\mathsf{MC}(c,p)=\frac{\mathsf{Con}(c)\odot\mathsf{Facet}(p)}{\|%\mathsf{Con}(c)\odot\mathsf{Facet}(p)\|}sansserif_MC ( italic_c , italic_p ) = divide start_ARG sansserif_Con ( italic_c ) ⊙ sansserif_Facet ( italic_p ) end_ARG start_ARG ∥ sansserif_Con ( italic_c ) ⊙ sansserif_Facet ( italic_p ) ∥ end_ARG

where we write direct-product\odot for the component-wise product. We essentially keep the same bi-encoder model but instead rely on these masked concept embeddings:

1=subscript1absent\displaystyle\mathcal{L}_{1}\,{=}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT =(c,p)𝒟𝖼𝗉logσ(𝖬𝖢(c,p)𝖯𝗋𝗈𝗉(p))subscript𝑐𝑝subscript𝒟𝖼𝗉𝜎𝖬𝖢𝑐𝑝𝖯𝗋𝗈𝗉𝑝\displaystyle-\sum_{(c,p)\in\mathcal{D}_{\mathsf{cp}}}\log\sigma\big{(}\mathsf%{MC}(c,p)\cdot\mathsf{Prop}(p)\big{)}- ∑ start_POSTSUBSCRIPT ( italic_c , italic_p ) ∈ caligraphic_D start_POSTSUBSCRIPT sansserif_cp end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_σ ( sansserif_MC ( italic_c , italic_p ) ⋅ sansserif_Prop ( italic_p ) )
(c,p)𝒩𝖼𝗉log(1σ(𝖬𝖢(c,p)𝖯𝗋𝗈𝗉(p)))subscript𝑐𝑝subscript𝒩𝖼𝗉1𝜎𝖬𝖢𝑐𝑝𝖯𝗋𝗈𝗉𝑝\displaystyle-\sum_{(c,p)\in\mathcal{N}_{\mathsf{cp}}}\log\big{(}1-\sigma\big{%(}\mathsf{MC}(c,p)\cdot\mathsf{Prop}(p)\big{)}\big{)}- ∑ start_POSTSUBSCRIPT ( italic_c , italic_p ) ∈ caligraphic_N start_POSTSUBSCRIPT sansserif_cp end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log ( 1 - italic_σ ( sansserif_MC ( italic_c , italic_p ) ⋅ sansserif_Prop ( italic_p ) ) )

Without further supervision, the facet encoder does not learn meaningful facets. Therefore, we use the (property, facet) examples from 𝒟𝗉𝖿subscript𝒟𝗉𝖿\mathcal{D}_{\mathsf{pf}}caligraphic_D start_POSTSUBSCRIPT sansserif_pf end_POSTSUBSCRIPT to ensure that properties which belong to the same facet have a similar facet embedding. For a given facet f𝑓fitalic_f, we write 𝒫fsubscript𝒫𝑓\mathcal{P}_{f}caligraphic_P start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT for the set of properties that we know to belong to this facet, i.e.𝒫f={p|(p,f)𝒟𝗉𝖿}subscript𝒫𝑓conditional-set𝑝𝑝𝑓subscript𝒟𝗉𝖿\mathcal{P}_{f}=\{p\,|\,(p,f)\in\mathcal{D}_{\mathsf{pf}}\}caligraphic_P start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = { italic_p | ( italic_p , italic_f ) ∈ caligraphic_D start_POSTSUBSCRIPT sansserif_pf end_POSTSUBSCRIPT }. We use the InfoNCE loss:

2=subscript2absent\displaystyle\mathcal{L}_{2}\,{=}caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT =fp,q𝒫flogexp(cos(𝖥(p),𝖥(q))τ)exp(cos(𝖥(p),𝖥(r))τ)subscript𝑓subscript𝑝𝑞subscript𝒫𝑓𝖥𝑝𝖥𝑞𝜏𝖥𝑝𝖥𝑟𝜏\displaystyle-\sum_{f}\sum_{p,q\in\mathcal{P}_{f}}\log\frac{\exp\left(\frac{%\cos(\mathsf{F}(p),\mathsf{F}(q))}{\tau}\right)}{\sum\exp\left(\frac{\cos(%\mathsf{F}(p),\mathsf{F}(r))}{\tau}\right)}- ∑ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_p , italic_q ∈ caligraphic_P start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log divide start_ARG roman_exp ( divide start_ARG roman_cos ( sansserif_F ( italic_p ) , sansserif_F ( italic_q ) ) end_ARG start_ARG italic_τ end_ARG ) end_ARG start_ARG ∑ roman_exp ( divide start_ARG roman_cos ( sansserif_F ( italic_p ) , sansserif_F ( italic_r ) ) end_ARG start_ARG italic_τ end_ARG ) end_ARG

where we abbreviated 𝖥𝖺𝖼𝖾𝗍(p)𝖥𝖺𝖼𝖾𝗍𝑝\mathsf{Facet}(p)sansserif_Facet ( italic_p ) as 𝖥(p)𝖥𝑝\mathsf{F}(p)sansserif_F ( italic_p ), the sum in the denominator ranges over r{q}{p|(p,f)𝒟𝗉𝖿}𝑟𝑞conditional-set𝑝𝑝𝑓subscript𝒟𝗉𝖿r\in\{q\}\cup\{p\,|\,(p,f)\notin\mathcal{D}_{\mathsf{pf}}\}italic_r ∈ { italic_q } ∪ { italic_p | ( italic_p , italic_f ) ∉ caligraphic_D start_POSTSUBSCRIPT sansserif_pf end_POSTSUBSCRIPT }, and the temperature τ>0𝜏0\tau>0italic_τ > 0 is a hyperparameter. The InfoNCE loss encourages properties which belong to the same facet to have facet embeddings that are more similar to each other than to the facet embeddings of properties which do not. The overall model is trained by optimising the loss 1+2subscript1subscript2\mathcal{L}_{1}+\mathcal{L}_{2}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

3.3 Extracting Facet-Specific Representations

The model from Section 3.2 can be used in several ways. First, we can simply use the concept embeddings 𝖢𝗈𝗇(c)𝖢𝗈𝗇𝑐\mathsf{Con}(c)sansserif_Con ( italic_c ) to represent each concept c𝑐citalic_c. In this case, the purpose of having facets is to ensure that the concept embeddings capture a broader range of properties, but we only consider these facets during training. We will refer to this approach as ConEmb-F. The concept embeddings from the standard bi-encoder, without facet embeddings, will be referred to as ConEmb.

In some applications, concept embeddings are used for clustering concepts. The purpose of multi-facet embeddings is to ensure that different kinds of clusters can be found. In such settings, we extract different facet-specific concept embeddings from the model. Specifically, let 𝒫={p1,,pm}𝒫subscript𝑝1subscript𝑝𝑚\mathcal{P}=\{p_{1},...,p_{m}\}caligraphic_P = { italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } be the set of properties of interest. For each property pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT we have a corresponding facet vector 𝐟i=𝖥𝖺𝖼𝖾𝗍(pi)subscript𝐟𝑖𝖥𝖺𝖼𝖾𝗍subscript𝑝𝑖\mathbf{f}_{i}=\mathsf{Facet}(p_{i})bold_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = sansserif_Facet ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). We use the K-means algorithm to cluster these facet vectors into clusters 𝒳1,𝒳ksubscript𝒳1subscript𝒳𝑘\mathcal{X}_{1},...\mathcal{X}_{k}caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … caligraphic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and treat each of these clusters as facet.We associate each concept c𝑐citalic_c with k𝑘kitalic_k facet-specific representations 𝐜1,,𝐜ksubscript𝐜1subscript𝐜𝑘\mathbf{c}_{1},...,\mathbf{c}_{k}bold_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, defined as:

𝐜j=𝖢𝗈𝗇(c)(pi𝒳j𝐟i)𝖢𝗈𝗇(c)(pi𝒳j𝐟i)subscript𝐜𝑗direct-product𝖢𝗈𝗇𝑐subscriptsubscript𝑝𝑖subscript𝒳𝑗subscript𝐟𝑖normdirect-product𝖢𝗈𝗇𝑐subscriptsubscript𝑝𝑖subscript𝒳𝑗subscript𝐟𝑖\displaystyle\mathbf{c}_{j}=\frac{\mathsf{Con}(c)\odot\big{(}\sum_{p_{i}\in%\mathcal{X}_{j}}\mathbf{f}_{i}\big{)}}{\big{\|}\mathsf{Con}(c)\odot\big{(}\sum%_{p_{i}\in\mathcal{X}_{j}}\mathbf{f}_{i}\big{)}\big{\|}}bold_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = divide start_ARG sansserif_Con ( italic_c ) ⊙ ( ∑ start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG ∥ sansserif_Con ( italic_c ) ⊙ ( ∑ start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ end_ARG(1)

The representations obtained by this approach depend on how the set of properties 𝒫𝒫\mathcal{P}caligraphic_P is chosen. For our experiments, we simply set 𝒫𝒫\mathcal{P}caligraphic_P to be the set of all properties that appear in our training set.

4 Experiments

We analyse the effectiveness of the proposed multi-facet concept embedding model. We intrinsically evaluate the embeddings on predicting commonsense properties (Section 4.1) and outlier detection (Section 4.2). We furthermore consider two downstream applications: ontology completion (Section 4.3) and ultra-fine entity typing (Section 4.4).666Our datasets, pre-trained models and implementation are available at https://github.com/hananekth/facets_concepts_embeddings

LMTrain properties (𝒟𝖼𝗉subscript𝒟𝖼𝗉\mathcal{D}_{\mathsf{cp}}caligraphic_D start_POSTSUBSCRIPT sansserif_cp end_POSTSUBSCRIPT)Train facets (𝒟𝗉𝖿subscript𝒟𝗉𝖿\mathcal{D}_{\mathsf{pf}}caligraphic_D start_POSTSUBSCRIPT sansserif_pf end_POSTSUBSCRIPT)McRaeCSLB
ConPropC+PConPropC+P
BiEncBBMSCG-79.849.644.554.539.132.6
BiEncBLMSCG-80.549.345.557.741.836.4
BiEncRBMSCG-75.642.438.149.936.424.3
BiEncRLMSCG-80.146.542.559.042.536.0
BiEncBLCN-78.056.751.861.449.650.0
BiEncBLChatGPT-80.557.356.665.156.552.7
BiEncBLChatGPT+CN-81.762.159.567.859.653.1
BiEncBBChatGPT+CN-76.260.658.466.956.651.8
BiEncRBChatGPT+CN-75.860.158.266.156.351.7
BiEncRLChatGPT+CN-80.861.759.367.258.852.7
BiEnc-FBLChatGPT+CNCN84.363.557.769.461.059.9
BiEnc-FBLChatGPT+CNChatGPT84.364.965.569.561.661.9
BiEnc-FBLChatGPT+CNChatGPT+CN86.265.967.070.363.663.0
BiEnc-FBBChatGPT+CNChatGPT+CN82.163.061.265.360.259.9
BiEnc-FRBChatGPT+CNChatGPT+CN81.562.360.865.059.661.3
BiEnc-FRLChatGPT+CNChatGPT+CN85.665.165.969.263.162.8

4.1 Predicting Commonsense Properties

The use of facets should lead to concept embeddings that capture a wider range of properties. To test this hypothesis, we consider the task of commonsense property prediction, which we treat as a binary classification problem: given a concept and a commonsense property, decide whether the property is satisfied by the concept or not. The difficulty of this task depends on how the training-test split is constructed. One strategy, called concept split, ensures that the concepts appearing in the training and test data are different, but the properties are not. Gajbhiye etal. (2022) found that simple nearest neighbour strategies can do well on this task, meaning that this variant does not adequately assess whether the concept embeddings capture commonsense knowledge. For this reason, they proposed a property split, where the properties appearing in training and test are different, but the concepts are the same. Finally, they also considered a C+P (concept+property) split, where both the concepts and properties are different in training and test. In all cases, we first pre-train the encoders 𝖢𝗈𝗇𝖢𝗈𝗇\mathsf{Con}sansserif_Con, 𝖯𝗋𝗈𝗉𝖯𝗋𝗈𝗉\mathsf{Prop}sansserif_Prop and 𝖥𝖺𝖼𝖾𝗍𝖥𝖺𝖼𝖾𝗍\mathsf{Facet}sansserif_Facet on the ChatGPT and/or ConceptNet training data (see Section 3.1), before fine-tuning on the training splits of the property prediction benchmarks.777Detailed training details can be found in the appendix.

Table 1 summarises the results for all three settings, using two standard benchmarks for commonsense property prediction: the extension of the McRae property norms dataset McRae etal. (2005) that was introduced by Forbes etal. (2019) and the augmented version of CSLB888https://cslb.psychol.cam.ac.uk/propnorms introduced by Misra etal. (2022). Our main baseline is the bi-encoder model from Gajbhiye etal. (2022), shown as BiEnc, which also forms the basis for our facet-based model. We report the results from Gajbhiye etal. (2022), which are for a model that was pre-trained on Microsoft Concept Graph Ji etal. (2019) and GenericsKB Bhakthavatsalam etal. (2020), as well as results for models that we trained on the ChatGPT and ConceptNet training sets (see Section 3.1). Finally, we show the result of our full model (i.e.with the facet encoder), shown as BiEnc-F. We compare four different encoders: BERT-base, BERT-large, RoBERTa-base and RoBERTa-large.

The results clearly show the benefit of using facets, as the BiEnc-F models consistently and substantially outperform the BiEnc baselines. Comparing the different training sets, the ChatGPT training examples are more effective than the ConceptNet examples, but the best results are obtained when both sources are combined. Among the different LMs, BERT-large achieves the best results.

Based on the results from this experiment, for the remaining experiments, we will focus on the model based on BERT-large, which is trained on both the ChatGPT and ConceptNet training examples.

ConEmbConEmb-FMultiConEmb
Dangerous9313
Edible232667
Flies233477
Hot71140
Lives in water334883
Produces noise131348
Sharp333067
Used by children8917
Used for cooking595393
Worn on feet181354

4.2 Outlier Detection

To evaluate whether our facet-based concept embeddings can help us to identify commonalities, we consider the task of outlier detection Camacho-Collados and Navigli (2016); Blair etal. (2017); BrinkAndersen etal. (2020). In each instance of this task, we are given a set of concepts (or entities). Some of these concepts have a particular property in common, and the task consists in identifying these concepts (without being given any information about the shared property itself). This task has traditionally been used as an intrinsic benchmark for evaluating word embeddings.

Dataset

Existing benchmarks mostly focus on broad taxonomic categories, whereas we are specifically interested in identifying shared commonsense properties. We therefore constructed a new outlier detection benchmark based on the extended McRae dataset from Forbes etal. (2019). To create an outlier detection problem instance, we first select a property from this dataset (e.g.dangerous) as well as 3 concepts which have this property and 7 outlier concepts which do not. Many of the properties in the McRae dataset are not suitable for our benchmark, either because they correspond to taxonomic categories (e.g.an animal or edible) or because they are too ambiguous or noisy (e.g.accordion, car and escalator are described as having the property fun, but airplane is not). Therefore, we manually selected 10 properties which do not suffer from these limitations. For each property, we manually clustered the concepts with this property into broad taxonomic groups.999The resulting clusters can be found in the appendix. When selecting the 3 positive concepts, for a given instance, we ensure that all three examples come from a different group. When selecting the 7 outliers, we check that they do not share any properties. Specifically, we ensure that any two of the outliers do not have any properties in common in the extended McRae dataset, in ConceptNet or in Ascent++101010https://ascentpp.mpi-inf.mpg.de. For each property, we sample 100 problem instances, following this process. We report the results in terms of exact match, i.e.the prediction for a given instance is labelled as correct if the three positive examples were correctly identified. We report the percentage of correctly labelled instances, for each property.

Methods

We compare three strategies for detecting outliers. For the method denoted ConEmb, we use the ConEmb embeddings as follows. For each concept c𝑐citalic_c, we find the second and third nearest neighbour. Let us denote these as c2subscript𝑐2c_{2}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and c3subscript𝑐3c_{3}italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT. If c𝑐citalic_c is a positive example, cos(𝖢𝗈𝗇(c),𝖢𝗈𝗇(c2))𝖢𝗈𝗇𝑐𝖢𝗈𝗇subscript𝑐2\cos(\mathsf{Con}(c),\mathsf{Con}(c_{2}))roman_cos ( sansserif_Con ( italic_c ) , sansserif_Con ( italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) should be high and cos(𝖢𝗈𝗇(c),𝖢𝗈𝗇(c3)\cos(\mathsf{Con}(c),\mathsf{Con}(c_{3})roman_cos ( sansserif_Con ( italic_c ) , sansserif_Con ( italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) should be low. We thus score each concept asscore(c)=cos(𝖢𝗈𝗇(c),𝖢𝗈𝗇(c2))cos(𝖢𝗈𝗇(c),𝖢𝗈𝗇(c3))score𝑐𝖢𝗈𝗇𝑐𝖢𝗈𝗇subscript𝑐2𝖢𝗈𝗇𝑐𝖢𝗈𝗇subscript𝑐3\textit{score}(c)=\cos(\mathsf{Con}(c),\mathsf{Con}(c_{2}))-\cos(\mathsf{Con}(%c),\mathsf{Con}(c_{3}))score ( italic_c ) = roman_cos ( sansserif_Con ( italic_c ) , sansserif_Con ( italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) - roman_cos ( sansserif_Con ( italic_c ) , sansserif_Con ( italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) ).As positive examples, we then select the concept with the highest score along with its two nearest neighbours. The method denoted ConEmb-F uses the same strategy, but instead uses the ConEmb-F embeddings. Finally, when using the method denoted MultiConEmb, we first obtain 10 facet-specific embeddings of each concept, using (1). We then first apply the same method as before to each of the 10 facet-specific embedding spaces. Finally, we select the prediction for the facet where the score of the highest-scoring concept was maximal.

Results

The results are summarised in Table 2. MultiConEMb, which exploits facet-specific representations, substantially outperforms the baselines. The performance of ConEmb and ConEmb-F is comparable, which is as expected: even though ComEmb-F was trained using facets, this method represents concepts as single vectors, and the similarities between these concept vectors still mostly reflect taxonomic relatedness.

WineEconOlymTranSUMO
GloVe14.214.19.98.334.9
Skipgram13.813.58.37.233.4
Numberbatch25.626.226.816.047.3
MirrorBERT22.523.820.912.740.1
MirrorWiC24.724.922.113.946.9
ConCN31.332.429.720.952.6
ConEmb30.830.528.619.851.3
ConEmb-F31.231.830.420.951.7
Clu (ConCN)35.333.132.521.652.2
Clu (ConEmb-F)36.934.234.622.153.3
MClu (ConCN)39.835.932.622.754.2
MClu (ConEmb-F)39.936.332.923.155.4

4.3 Ontology Completion

Ontologies use rules to encode how the concepts of a given domain are related. They generalise taxonomies by allowing the use of logical connectives to encode these relationships. Li etal. (2019) introduced a framework for predicting missing rules in ontologies using a Graph Neural Network. The nodes of the considered graph correspond to the concepts from the ontology, and the input representations are pre-trained concept embeddings. Recent work has used this model to evaluate concept embeddings, as its overall performance is sensitive to the quality of the input representations Li etal. (2023b). The intuition underpinning the model is closely aligned with the idea of modelling concept commonalities. Essentially, if the ontology contains the rules111111In description logic syntax, XYsquare-image-of-or-equals𝑋𝑌X\sqsubseteq Yitalic_X ⊑ italic_Y means that every instance of the concept X𝑋Xitalic_X is also an instance of the concept Y𝑌Yitalic_Y, i.e.it represents the rule “if X𝑋Xitalic_X then Y𝑌Yitalic_Y”. X1Y,,XkYformulae-sequencesquare-image-of-or-equalssubscript𝑋1𝑌square-image-of-or-equalssubscript𝑋𝑘𝑌X_{1}\sqsubseteq Y,...,X_{k}\sqsubseteq Yitalic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⊑ italic_Y , … , italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⊑ italic_Y and we know from the pre-trained embeddings that Xk+1subscript𝑋𝑘1X_{k+1}italic_X start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT is similar to X1,,Xksubscript𝑋1subscript𝑋𝑘X_{1},...,X_{k}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT then it is plausible that the rule Xk+1Ysquare-image-of-or-equalssubscript𝑋𝑘1𝑌X_{k+1}\sqsubseteq Yitalic_X start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ⊑ italic_Y is valid as well.

We test the effectiveness of our model in two ways. First, we use the ConEmb-F concept embeddings as input features, which allows for a direct comparison with the effectiveness of other concept embedding models. In this case, the use of facets only affects how the concept embeddings are learned. As a second strategy, referred to as MClu, we first obtain 10 facet-specific embeddings of all the concepts, using (1). We then cluster the concepts in each of the facet-specific concept embeddings separately. For this step, we have relied on affinity propagation. This results in 10 different clusterings of the concepts. For each cluster 𝒞𝒞\mathcal{C}caligraphic_C in each of these clustering we add a fresh concept Y𝒞subscript𝑌𝒞Y_{\mathcal{C}}italic_Y start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT to the ontology, and for every concept X𝑋Xitalic_X in 𝒞𝒞\mathcal{C}caligraphic_C, we add the rule XY𝒞square-image-of-or-equals𝑋subscript𝑌𝒞X\sqsubseteq Y_{\mathcal{C}}italic_X ⊑ italic_Y start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT. We then apply the standard GNN model from Li etal. (2019) to the resulting extended ontology. As a baseline, we also apply the clustering strategy to the ConEmb concept embeddings, which we refer to as Clu. Note that in this case there is only one clustering. Note that when MClu or Clu are used, we still need to use concept embeddings to use as input features. We show results with the ConCN embeddings from Li etal. (2023b) and for our ConEmb-F embeddings.

The results in Table 3 show that the ConEmb-F input embeddings consistently outperform the ConEmb vectors. Moreover, they achieve a performance which is similar to that of the ConCN embeddings, which is the current state-of-the-art. Moreover, the proposed clustering strategies, which have not previously been considered for ontology completion, are highly effective. The MClu variant outperforms Clu in all but one case, which shows the benefit of explicitly considering facet-specific embeddings. We can also see that the ConEmb-F embeddings as input features perform better than the ConCN embeddings, when used in combination with the clustering strategies.

F1
Base model49.2
Properties50.9
Clu (ConCN)50.4
Clu (ConEmb)50.6
Clu (ConEmb-F)50.8
Clu (ConCN) + properties50.9
Clu (ConEmb) + properties51.1
MClu51.3

4.4 Ultra-Fine Entity Typing

We consider the task of ultra-fine entity typing Choi etal. (2018), which was also used by Gajbhiye etal. (2023) to demonstrate the usefulness of modelling concept commonalities. Given a sentence in which an entity mention is highlighted, the task consists in assigning labels that describe the semantic type of the entity. The task is formulated as a multi-label classification problem with around 10K candidate labels. Many of the candidate labels only have a small number of occurrences in the training data. This makes it paramount to rely on some kind of pre-trained knowledge about the meaning of the labels. Li etal. (2023a) proposed a simple but surprisingly effective strategy: use pre-trained concept embeddings to cluster the labels and augment the training set with labels that refer to these clusters. For instance, if a training example is labelled with label l𝑙litalic_l and this label belongs to cluster c𝑐citalic_c then they add the synthetic label “cluster c” to this training example. This intuitively teaches the model which labels are semantically related, as the training objective encourages instances which are labelled with “cluster c𝑐citalic_c” to be linearly separated from other instances. Gajbhiye etal. (2023) improved on this strategy by instead identifying commonsense properties that were satisfied by the different concepts/labels, and by augmenting the training examples with these properties, instead of synthetic cluster labels. This use of shared properties has the advantage that a broader range of commonalities can be identified, whereas clustering standard concept embeddings leads to clusters that only reflect standard taxonomic categories. Our hypothesis is that we can achieve the same benefits by clustering our multi-facet representations, and that the use of clusters can potentially lead us to capture finer-grained commonalities.

Table 4 summarises the results. All results were obtained using the DenoiseFET model from Pan etal. (2022). The base model in Table 4 shows the results if we use this model without augmenting the training labels. Properties refers to the strategy from Gajbhiye etal. (2023), which adds labels corresponding to shared properties, while Clu is the strategy from Li etal. (2023a), which adds labels corresponding to clusters. We show results for this clustering strategy with three different concept embeddings: the ConCN embeddings from Li etal. (2023b) as well as ConEmb and ConEmb-F. The approach where we use clusterings from different facet-specific embeddings is shown as MClu. We can see that MClu achieves the best results, which confirms the usefulness of facet-specific representations for this task. When using the Clu strategy, ConEmb-F also slighly outperforms ConEmb.

5 Conclusions

Many applications rely on background knowledge about the meaning of concepts. What is needed often boils down to knowledge about the commonalities between different concepts, as this forms the basis for inductive generalisation. Clustering pre-trained concept embeddings has been proposed in previous work as a viable strategy for modelling such commonalities. However, the resulting clusters primarily capture taxonomic categories, while commonalities that depend on various commonsense properties are essentially ignored. In this paper, we proposed a simple strategy for obtaining more diverse representations, by taking into account different facets of meaning when training the concept encoder. We found that the resulting concept representations lead to consistently better results, across all the considered tasks.

Acknowledgments

This work was supported by EPSRC grants EP/V025961/1 and EP/W003309/1, ANR-22-CE23-0002 ERIANA and HPC resources from GENCI-IDRIS (Grant 2023-[AD011013338R1]).

Limitations

Our approach relies on encoders from the BERT family, which are much smaller than recent language models. We did some initial experiments with Llama 2, but were not successful in obtaining better-performing concept embeddings with this model. While it seems likely that future work will reveal more effective strategies for using larger models, our use of BERT still has the advantage that we can efficiently encode a large number of labels, which remains important for applications such as extreme multi-label text classification.

We have only looked at modelling commonsense properties of concepts. Modelling facets of meaning is intuitively also important for modelling named entities (e.g.for entity linking) and for sentence/document embeddings (e.g.for retrieval). An analysis of facet-based models for such applications is left as a topic for future work.

References

Appendix A Training Details

Predicting Commonsense Properties

We use the AdamW optimizer with learning rate 2e-5. We use early stopping with a patience of 20.We use the same settings as Gajbhiye etal. (2022) for both fine-tuning and pre-training. During pre-training, we randomly select 10000 pairs for tuning. For the concept split we use the fixed training-test split from Gajbhiye etal. (2022). For property split, we use 5-fold cross-validation, while for C+P, we used the 3 ×\times× 3 fold cross-validation strategy. During fine-tuning, for all splits, we randomly select 20% from the training set as validation data for model selection. We use a batch size of 32 and set the maximal sequence length for the concept and property prompts to 32.

Outlier Detection

For these experiments, we use the same bi-encoder models as for predicting commonsense properties. Specifically, we have used the model initialised form BERT-large-uncased, which was trained using ConceptNet and ChatGPT. However, in this case, the model is used without further fine-tuning.

Ontology Completion

To get concept clusters from the affinity propagation algorithm, we tune the preference values from {0.5,0.6,0.7,0.8,0.9}0.50.60.70.80.9\{0.5,0.6,0.7,0.8,0.9\}{ 0.5 , 0.6 , 0.7 , 0.8 , 0.9 }. We set the learning rate to 1e-2, the maximum number of epochs to 200 and the dropout rate to 0.5. We use AdamW as optimizer and tune the number of hidden dimensions from {8,16,32,64}8163264\{8,16,32,64\}{ 8 , 16 , 32 , 64 } and the number of GNN layers from {3,4,5}345\{3,4,5\}{ 3 , 4 , 5 }. We use a weight decay of 5e-2.

Ultra-Fine Entity Typing

To get concept clusters, we again use affinity propagation and select the preference values from {0.5,0.6,0.7,0.8,0.9}0.50.60.70.80.9\{0.5,0.6,0.7,0.8,0.9\}{ 0.5 , 0.6 , 0.7 , 0.8 , 0.9 }. We use a learning rate of 2e-5 with AdamW as optimizer, a batch size of 16 and a maximum number of 500 epochs.

PropertyPositive Examples
Dangerous1: alligator, bear, beehive, bull, cheetah, crocodile, lion, rattlesnake, tiger
2: bazooka, bomb, bullet, crossbow, dagger, grenade, gun, harpoon, missile, pistol, revolver, rifle, rocket, shotgun, sword, axe, baseball bat, knife, machete
3: motorcycle
Edible1: apple, banana, cranberry, tomato, tangerine, strawberry, spinach, rhubarb, raspberry, radish, pumpkin, prune, mushroom, parsley, walnut, rice, raisin, potato, plum, pineapple, pepper, peas, pear, peach, orange, onions, olive, nectarine, lime, lettuce, lemon, grapefruit, grape, garlic, cucumber, corn, coconut, cherry, celery, cauliflower, carrot, cantaloupe, cabbage, broccoli, blueberry, beets, beans, avocado, asparagus
2: bread, cake, cheese, hot dog, pizza, sandwich, pie, donut, biscuit
3: crab, deer, hare, octopus, salmon, turkey, tuna, trout, squid, shrimp, sardine, pig, octopus, lobster, lamb, goat, cow, clam, chicken
Flies1: bird, crow, dove, duck, eagle, falcon, flamingo, goose, hawk, woodpecker, pigeon, owl, peaco*ck, seagull
2: butterfly, hornet, housefly, wasp, moth
3: airplane, helicopter, jet, missile, rocket
4: balloon, kite, frisbee
Hot1: bathtub
2: cigar, cigarette, candle
3: hair drier, kettle, oven, stove, toaster
Lives in water1: alligator, crocodile, otter, turtle, seal, frog, flamingo, salamander, swan, walrus
2: clam, crab, lobster, octopus, squid, shrimp
3: dolphin, eel, goldfish, whale, salmon, sardine, trout, tuna
4: boat, sailboat, ship, submarine, yacht, canoe
Produces noise1: accordion, bagpipe, clarinet, flute, harp, piano, trombone, violin
2: airplane, ambulance, helicopter
3: bomb, cannon, grenade
4: cell phone, hair drier, stereo
Sharp1: axe, bayonet, dagger, machete, knife, spear, sword
2: chisel, corkscrew, screwdriver, scissors
3: blender, grater
4: razor
5: pin
Used by children1: balloon
2: buggy
3: crayon, paintbrush, pencil
4: doll, teddy bear, toy
5: earmuffs
6: frisbee, kite
7: skateboard, sled, tricycle
Used for cooking1: apron
2: blender, grater, mixer
3: bowl, colander, pan, pot, skillet, strainer
4: kettle, microwave, oven, stove, toaster
5: knife, ladle, spatula, spoon, tongs
Worn on feet1: boots, sandals, shoes, slippers
2: nylons, socks
3: skis, snowboard

Appendix B Outlier Detection Dataset

Table 5 shows the properties that were selected from the McRae dataset for constructing the outlier detection benchmark. For each property, we selected concepts that are asserted to have this property in the dataset, and we organised them into broad taxonomic categories, which are also shown in Table 5.

Appendix C Qualitative Analysis

Table 6 shows examples of the nearest neighbours of frisbee and bureau in some of the facet-specific embedding spaces that are used by the MClu strategy. For this analysis, we use the set of concepts from the McRae dataset. These examples illustrate how different facets emphasise different aspects.

For the case of frisbee, the first facet links this concept to other sports related terms. In the second facet, it is instead related to round things. In the third facet, the top neighbours are related to kids. Due to the way in which the facets are learned (e.g.by considering a fixed number of facet embedding clusters), there are also some facets that reflect a mixture of different aspects. For instance, the last facet in Table 6 combines elements of the first three facets, i.e.the nearest neighbours cover sports related concepts, kids related concepts, and round things. For bureau, in the first facet we find office related terms. In the second facet, the top neighbours are different types of furniture. In facets 3 and 4 we see a mixture of different kinds of terms.

Table 7 similarly shows examples of nearest neighbours in different facets for the label space from the UFET dataset.

ConceptNeighbours
frisbeeFacet 1: tricycle, surfboard, sports_ball, tennis_racket, snowboard, kite, doll, balloon, toy, pie
Facet 2: balloon, pie, sports_ball, cake, donut, teddy_bear, surfboard, kite, doll, toy
Facet 3: kite, doll, balloon, toy, tricycle, moth, pie, sports_ball, surfboard, football
Facet 4: tricycle, surfboard, sports_ball, tennis_racket, snowboard, kite, doll, balloon, toy, pie
bureauFacet 1: envelope, certificate, typewriter, doorknob, fence, cabinet, carpet, shelves, bookcase, gopher
Facet 2: desk, dining_table, table, shelves, bookcase, envelope, typewriter, gopher, escalator, peg
Facet 3: shelves, bookcase, envelope, desk, peg, gopher, tack, cabinet, handbag, hook
Facet 4: bookcase, cabinet, shelves, desk, doorknob, dining_table, envelope, typewriter, handbag, lamp
ConceptNeighbours
keyboardFacet 1: bass, rhythm guitar, lead guitar, drum major, pianist, bass drum, bass guitar, bassist, bass guitarist, air guitar
Facet 2: processor, organ, desktop, mac, storage, file system, thumb drive, file extension, touch screen, desktop environment
business cardFacet 1: flash card, smart card, bank card, graphics card, green card, index card, network card, phone card, press card, punch card, report card, trade card
Facet 2: business class, business cycle, business day, business economics, business end, business ethics, business intelligence, business logic
Modelling Commonsense Commonalities with Multi-Facet Concept Embeddings (2024)
Top Articles
Latest Posts
Article information

Author: Edmund Hettinger DC

Last Updated:

Views: 6275

Rating: 4.8 / 5 (78 voted)

Reviews: 93% of readers found this page helpful

Author information

Name: Edmund Hettinger DC

Birthday: 1994-08-17

Address: 2033 Gerhold Pine, Port Jocelyn, VA 12101-5654

Phone: +8524399971620

Job: Central Manufacturing Supervisor

Hobby: Jogging, Metalworking, Tai chi, Shopping, Puzzles, Rock climbing, Crocheting

Introduction: My name is Edmund Hettinger DC, I am a adventurous, colorful, gifted, determined, precious, open, colorful person who loves writing and wants to share my knowledge and understanding with you.