Updating Pre-trained Word Vectors and Text Classifiers using Monolingual Alignment
The authors drew inspiration from the way #multilingual word vectors are learned. They treated general-purpose and domain-specific corpora as separate languages and used a word-embedding model to learn independent vectors from each. Then they aligned the vectors from one corpus with those from another.
To align word vectors from two corpora, common words are used to find a consistent way to represent all words. For example, if one corpus is [human, cat] and the other is [cat, dog], the model applies a transformation that unifies the dog word vectors while retaining the relative positions of the word vectors between cats, dogs, and humans.
A word-embedding model learns independent word vectors from both corpora.
The authors use a loss function called #RCSLS for training. RCSLS balances two objectives: General-purpose vectors that are close together remain close together, while general-purpose vectors that far apart remain far apart. Common words in the two corpora now have duplicate vectors. Averaging them produces a single vector representation.
They consider applications to word embedding and text, classification models. Show that the proposed approach yields good performance in all setups and outperforms a baseline consisting of fine-tuning the model on new data.
paper: https://arxiv.org/abs/1910.06241
#nlp
The authors drew inspiration from the way #multilingual word vectors are learned. They treated general-purpose and domain-specific corpora as separate languages and used a word-embedding model to learn independent vectors from each. Then they aligned the vectors from one corpus with those from another.
To align word vectors from two corpora, common words are used to find a consistent way to represent all words. For example, if one corpus is [human, cat] and the other is [cat, dog], the model applies a transformation that unifies the dog word vectors while retaining the relative positions of the word vectors between cats, dogs, and humans.
A word-embedding model learns independent word vectors from both corpora.
The authors use a loss function called #RCSLS for training. RCSLS balances two objectives: General-purpose vectors that are close together remain close together, while general-purpose vectors that far apart remain far apart. Common words in the two corpora now have duplicate vectors. Averaging them produces a single vector representation.
They consider applications to word embedding and text, classification models. Show that the proposed approach yields good performance in all setups and outperforms a baseline consisting of fine-tuning the model on new data.
paper: https://arxiv.org/abs/1910.06241
#nlp