Machinelearning
nbdev: use Jupyter Notebooks for everything https://www.fast.ai//2019/12/02/nbdev/ github: https://github.com/fastai/nbdev/
This is a forward from an independent channel run by our fellow engineers.
@ai_machinelearning_big_data
@ai_machinelearning_big_data
ββFreeLB: Enhanced Adversarial Training for Language Understanding
The authors propose a novel adversarial training algorithm β FreeLB, that promotes higher robustness and invariance in the embedding space, by adding adversarial perturbations to word embeddings and minimizing the resultant adversarial risk inside different regions around input samples, applied to Transformer-based models for NLU & commonsense reasoning tasks.
Experiments on the GLUE benchmark show that when applied only to the finetuning stage, it is able to improve the overall test scores:
* of BERT-based model from 78.3 -> 79.4
* RoBERTa-large model from 88.5 -> 88.8
The proposed approach achieves SOTA single-model test accuracies of 85.44% and 67.75% on ARC-Easy and ARC-Challenge.
paper: https://arxiv.org/abs/1909.11764
#nlp #nlu #bert #adversarial #ICLR
The authors propose a novel adversarial training algorithm β FreeLB, that promotes higher robustness and invariance in the embedding space, by adding adversarial perturbations to word embeddings and minimizing the resultant adversarial risk inside different regions around input samples, applied to Transformer-based models for NLU & commonsense reasoning tasks.
Experiments on the GLUE benchmark show that when applied only to the finetuning stage, it is able to improve the overall test scores:
* of BERT-based model from 78.3 -> 79.4
* RoBERTa-large model from 88.5 -> 88.8
The proposed approach achieves SOTA single-model test accuracies of 85.44% and 67.75% on ARC-Easy and ARC-Challenge.
paper: https://arxiv.org/abs/1909.11764
#nlp #nlu #bert #adversarial #ICLR
With new TensorBoard.dev you can share your DL/ML experiments result at tensorBoard
Link: https://blog.tensorflow.org/2019/12/introducing-tensorboarddev-new-way-to.html
#DL #ML #tensorflow #tf
Link: https://blog.tensorflow.org/2019/12/introducing-tensorboarddev-new-way-to.html
#DL #ML #tensorflow #tf
blog.tensorflow.org
Introducing TensorBoard.dev: a new way to share your ML experiment results
The TensorFlow blog contains regular news from the TensorFlow team and the community, with articles on Python, TensorFlow.js, TF Lite, TFX, and more.
How Much Over-parameterization Is Sufficient to Learn Deep ReLU Networks?
Chen et al., 2019 UCLA
arxiv.org/abs/1911.12360
The theory of deep learning is a new and fast-developing field. Recent studies suggest that huge over-parametrization of neural networks is not a bug, but a feature that allows deep NNs both to generalize and to be optimizable using simple (first-order gradient) optimization.
Chen et al. make another step into solving mysteries of deep learning, and their main results are:
1. Sharp optimization and generalization guarantees for deep ReLU networks
1. Better asymptotics that allows applying the theory to smaller networks (polylogarithmic instead of polynomial hidden size)
As authors say, "Our results push the study of over-parameterized deep neural networks towards more practical settings."
For a deep dive to a theory of deep learning, I suggest
iPavlov: github.com/deepmipt/tdl (Russian and English)
Stanford: stats385.github.io (English)
Chen et al., 2019 UCLA
arxiv.org/abs/1911.12360
The theory of deep learning is a new and fast-developing field. Recent studies suggest that huge over-parametrization of neural networks is not a bug, but a feature that allows deep NNs both to generalize and to be optimizable using simple (first-order gradient) optimization.
Chen et al. make another step into solving mysteries of deep learning, and their main results are:
1. Sharp optimization and generalization guarantees for deep ReLU networks
1. Better asymptotics that allows applying the theory to smaller networks (polylogarithmic instead of polynomial hidden size)
As authors say, "Our results push the study of over-parameterized deep neural networks towards more practical settings."
For a deep dive to a theory of deep learning, I suggest
iPavlov: github.com/deepmipt/tdl (Russian and English)
Stanford: stats385.github.io (English)
GitHub
GitHub - deeppavlov/tdl: Course "Theories of Deep Learning"
Course "Theories of Deep Learning". Contribute to deeppavlov/tdl development by creating an account on GitHub.
Forwarded from ML Trainings
Artur Kuzin tells about his participation in Kaggle Open Images 2019 in English. He got a gold medal in each of the three competitions.
In this video you will find out:
πΉDescription of the dataset and its markup procedures, as well as a description of the metric and its features
πΉArchitecture overview of the best models
πΉOverview of tricks and hacks from the top3 of each competition
πΉApproach for quick model training
https://youtu.be/NGnOY-AzDBg
In this video you will find out:
πΉDescription of the dataset and its markup procedures, as well as a description of the metric and its features
πΉArchitecture overview of the best models
πΉOverview of tricks and hacks from the top3 of each competition
πΉApproach for quick model training
https://youtu.be/NGnOY-AzDBg
YouTube
Kaggle Open Images 2019 β Artur Kuzin
Artur Kuzin tells about his participation in Kaggle Open Images 2019 in English. He got a gold medal in each of the three competitions.
In this video you will find out:
- Description of the dataset and its markup procedures, as well as a description ofβ¦
In this video you will find out:
- Description of the dataset and its markup procedures, as well as a description ofβ¦
π³π±We received a request to setup weekly data breakfasts at Eindhoven (Holland).
Please pm @malev if you are resident or live nearby and up for joining weekly breakfasts.
Please pm @malev if you are resident or live nearby and up for joining weekly breakfasts.
Data Science by ODS.ai π¦ pinned Β«π³π±We received a request to setup weekly data breakfasts at Eindhoven (Holland). Please pm @malev if you are resident or live nearby and up for joining weekly breakfasts.Β»
Great artilce on pandas optimization
Link: https://medium.com/bigdatarepublic/advanced-pandas-optimize-speed-and-memory-a654b53be6c2
#pandas #optimization
Link: https://medium.com/bigdatarepublic/advanced-pandas-optimize-speed-and-memory-a654b53be6c2
#pandas #optimization
Medium
Advanced Pandas: Optimize speed and memory
Nowadays the Python data analysis library Pandas is widely used across the world. It started mostly as a data exploration andβ¦
π1
Deep Learning for Symbolic Mathematics
Abstract: Neural networks have a reputation for being better at solving statistical or approximate problems than at performing calculations or working with symbolic data. In this paper, we show that they can be surprisingly good at more elaborated tasks in mathematics, such as symbolic integration and solving differential equations. We propose a syntax for representing mathematical problems and methods for generating large datasets that can be used to train sequence-to-sequence models. We achieve results that outperform commercial Computer Algebra Systems such as Matlab or Mathematica.
They show that transformers cope surprisingly well with complex mathematical problems such as function integration and differential equations.
They adapt seq2seq models to these mathematical problems.
So they use a general transformer with 8 heads and 6 layers and Adam for the optimizer. But they present a framework to generate correct math problems to use in the dataset. Interestingly, beam search selects the right solutions for one task at high scores, which are simply alternatives to each other.
For integration, the model achieves close to 100% performance on a held-out test set, even with greedy decoding (beam size 1). This performance is consistent over the three integration datasets (FWD, BWD, and IBP). Greedy decoding (beam size 1) does not work as well for differential equations. In particular, we observe an improvement in accuracy of almost 40% when using a large beam size of 50 for second-order differential equations. Unlike in machine translation, where increasing the beam size does not necessarily increase the performance (Ott et al., 2018), we always observe significant improvements with wider beams. Typically, using a beam size of 50 provides an improvement of 8% accuracy compared to a beam size of 10. This makes sense, as increasing the beam size will provide more hypotheses, although a wider beam may displace a valid hypothesis to consider invalid ones with better log-probabilities.
The model also generalizes to integrate problems that SymPy cannot solve but has trained on problems that SymPy can.
In particular, we found that the accuracy of SymPy on the BWD test set is only 30%. Our FWD-trained model only obtains an accuracy of 17.2% on BWD. However, we observed that the FWD-trained model is sometimes able to compute the integral of functions that SymPy cannot compute. This means that by only training on functions that SymPy can integrate, the model was able to generalize to functions that SymPy cannot integrate. Table 7 presents examples of such functions with their integrals.
But the final solution to the problem is not pure end2end. Each hypothesis should be checked against the symbolic framework.
paper: https://arxiv.org/abs/1912.01412
tweet: http://twitter.com/GuillaumeLample/status/1202178956063064064
Abstract: Neural networks have a reputation for being better at solving statistical or approximate problems than at performing calculations or working with symbolic data. In this paper, we show that they can be surprisingly good at more elaborated tasks in mathematics, such as symbolic integration and solving differential equations. We propose a syntax for representing mathematical problems and methods for generating large datasets that can be used to train sequence-to-sequence models. We achieve results that outperform commercial Computer Algebra Systems such as Matlab or Mathematica.
They show that transformers cope surprisingly well with complex mathematical problems such as function integration and differential equations.
They adapt seq2seq models to these mathematical problems.
So they use a general transformer with 8 heads and 6 layers and Adam for the optimizer. But they present a framework to generate correct math problems to use in the dataset. Interestingly, beam search selects the right solutions for one task at high scores, which are simply alternatives to each other.
For integration, the model achieves close to 100% performance on a held-out test set, even with greedy decoding (beam size 1). This performance is consistent over the three integration datasets (FWD, BWD, and IBP). Greedy decoding (beam size 1) does not work as well for differential equations. In particular, we observe an improvement in accuracy of almost 40% when using a large beam size of 50 for second-order differential equations. Unlike in machine translation, where increasing the beam size does not necessarily increase the performance (Ott et al., 2018), we always observe significant improvements with wider beams. Typically, using a beam size of 50 provides an improvement of 8% accuracy compared to a beam size of 10. This makes sense, as increasing the beam size will provide more hypotheses, although a wider beam may displace a valid hypothesis to consider invalid ones with better log-probabilities.
The model also generalizes to integrate problems that SymPy cannot solve but has trained on problems that SymPy can.
In particular, we found that the accuracy of SymPy on the BWD test set is only 30%. Our FWD-trained model only obtains an accuracy of 17.2% on BWD. However, we observed that the FWD-trained model is sometimes able to compute the integral of functions that SymPy cannot compute. This means that by only training on functions that SymPy can integrate, the model was able to generalize to functions that SymPy cannot integrate. Table 7 presents examples of such functions with their integrals.
But the final solution to the problem is not pure end2end. Each hypothesis should be checked against the symbolic framework.
paper: https://arxiv.org/abs/1912.01412
tweet: http://twitter.com/GuillaumeLample/status/1202178956063064064
Twitter
Guillaume Lample
Our new paper, Deep Learning for Symbolic Mathematics, is now on arXiv arxiv.org/abs/1912.01412 We added *a lot* of new results compared to the original submission. With @f_charton (1/7)
ODS breakfast in Paris! π«π· See you this Saturday at 10:30 (many people come around 11:00, and this time because of greve, it can be more) at Malongo CafΓ©, 50 Rue Saint-AndrΓ© des Arts. We are expecting from 4 to 10 people.
ββYet another article about adversarial prints to fool face recognition
With progress on facial recognition, such prints will become more popular.
Link: https://medium.com/syncedreview/personal-invisibility-cloak-stymies-people-detectors-15bebdcc7943
#facerecognition #adversarial
With progress on facial recognition, such prints will become more popular.
Link: https://medium.com/syncedreview/personal-invisibility-cloak-stymies-people-detectors-15bebdcc7943
#facerecognition #adversarial
ββLOGAN: Latent Optimisation for Generative Adversarial Networks
Game-theory motivated algorithm from #DeepMind improves the state-of-the-art in #GAN image generation by over 30% measured in FID.
ArXiV: https://arxiv.org/abs/1912.00953
Game-theory motivated algorithm from #DeepMind improves the state-of-the-art in #GAN image generation by over 30% measured in FID.
ArXiV: https://arxiv.org/abs/1912.00953
πΉWhat's Hidden in a Randomly Weighted Neural Network?
Amazingly this paper finds a subnetwork with random weights in a Wide ResNet-50 that outperforms optimized weights in a ResNet-34 for ImageNet!
On the last ICLR article by Lottery Ticket Hypothesis β the authors showed that it is possible to take a trained big net, and throw out at 95% of the scales so that the rest can be learned on the same quality, starting with the same initialization.
In the follow-up Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask found out that, it is possible to leave the weight with the initialization and learn only the mask, throwing unnecessary connections from the network - so it was possible to get under 40% of the quality on the Cifar, teaching not the weight of the model, but only its structure. Similar observations were made for simple RL tasks, see Weight-Agnostic Neural Network.
However, it was not clear how much structure-only training works on normal datasets and large nets, or without the right weights.
In the article the authors for the first time start struture-only on Imagenet. For this purpose:
- It takes a bold grid aka DenseNet, weights are initialized from the "binaryized" kaiming normal (either +std, or -std instead of normal).
- For each weight, an additional scalar - score s, showing how important it is for a good prediction. On the inference we take the top-k% weights and zero out the rest.
- With fixed weights, we train the scores. The main trick is that although in the forward pass we, like in the inference, take only top-k weights, in the backward pass the gradient flows through all the scores. It is ambiguous LRD where all weights are used in the forward, and in the backward - only a small subset.
Thus we can to prune a random WideResnet50 and get 73.3% accuracy on imagenet and there will be less active weights than in Resnet34. Magic.
ArXiV: https://arxiv.org/pdf/1911.13299.pdf
YouTube explanation: https://www.youtube.com/watch?v=C6Tj8anJO-Q
via @JanRocketMan
#ImageNet #ResNet
Amazingly this paper finds a subnetwork with random weights in a Wide ResNet-50 that outperforms optimized weights in a ResNet-34 for ImageNet!
On the last ICLR article by Lottery Ticket Hypothesis β the authors showed that it is possible to take a trained big net, and throw out at 95% of the scales so that the rest can be learned on the same quality, starting with the same initialization.
In the follow-up Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask found out that, it is possible to leave the weight with the initialization and learn only the mask, throwing unnecessary connections from the network - so it was possible to get under 40% of the quality on the Cifar, teaching not the weight of the model, but only its structure. Similar observations were made for simple RL tasks, see Weight-Agnostic Neural Network.
However, it was not clear how much structure-only training works on normal datasets and large nets, or without the right weights.
In the article the authors for the first time start struture-only on Imagenet. For this purpose:
- It takes a bold grid aka DenseNet, weights are initialized from the "binaryized" kaiming normal (either +std, or -std instead of normal).
- For each weight, an additional scalar - score s, showing how important it is for a good prediction. On the inference we take the top-k% weights and zero out the rest.
- With fixed weights, we train the scores. The main trick is that although in the forward pass we, like in the inference, take only top-k weights, in the backward pass the gradient flows through all the scores. It is ambiguous LRD where all weights are used in the forward, and in the backward - only a small subset.
Thus we can to prune a random WideResnet50 and get 73.3% accuracy on imagenet and there will be less active weights than in Resnet34. Magic.
ArXiV: https://arxiv.org/pdf/1911.13299.pdf
YouTube explanation: https://www.youtube.com/watch?v=C6Tj8anJO-Q
via @JanRocketMan
#ImageNet #ResNet
arXiv.org
The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks
Neural network pruning techniques can reduce the parameter counts of trained networks by over 90%, decreasing storage requirements and improving computational performance of inference without...
ββTowards Lingua Franca Named Entity Recognition with BERT
The authors present a simple and effective recipe for building #multilingual #NER systems with #BERT.
By utilizing a multilingual BERT framework, they were able to not only train a system that can perform inference on English, German, Spanish, and Dutch languages, but it performs better than the same model trained only on one language at a time, and also is able to perform 0-shot inference.
The resulting model yields #SotA results on CoNLL Spanish and Dutch, and on OntoNotes Chinese and Arabic datasets.
Also, the English trained model yields SotA results for 0-shot languages for Spanish, Dutch, and German NER, improving it by a range of 2.4F to 17.8F.
Furthermore, the runtime signature (memory/CPU/GPU) of the model is the same as the models built on single languages, significantly simplifying its life- cycle maintenance.
paper: https://arxiv.org/abs/1912.01389
The authors present a simple and effective recipe for building #multilingual #NER systems with #BERT.
By utilizing a multilingual BERT framework, they were able to not only train a system that can perform inference on English, German, Spanish, and Dutch languages, but it performs better than the same model trained only on one language at a time, and also is able to perform 0-shot inference.
The resulting model yields #SotA results on CoNLL Spanish and Dutch, and on OntoNotes Chinese and Arabic datasets.
Also, the English trained model yields SotA results for 0-shot languages for Spanish, Dutch, and German NER, improving it by a range of 2.4F to 17.8F.
Furthermore, the runtime signature (memory/CPU/GPU) of the model is the same as the models built on single languages, significantly simplifying its life- cycle maintenance.
paper: https://arxiv.org/abs/1912.01389
Dream to Control: Learning Behaviors by Latent Imagination
Abstract: Learned world models summarize an agent's experience to facilitate learning complex behaviors. While learning world models from high-dimensional sensory inputs are becoming feasible through deep learning, there are many potential ways for deriving behaviors from them. We present Dreamer, a reinforcement learning agent that solves long-horizon tasks from images purely by latent imagination. We efficiently learn behaviors by propagating analytic gradients of learned state values back through trajectories imagined in the compact state space of a learned world model. On 20 challenging visual control tasks, Dreamer exceeds existing approaches in data-efficiency, computation time, and final performance.
Dreamer learns long-horizon behaviors from images purely by latent imagination. For this, it backpropagates value estimates through trajectories imagined in the compact latent space of a learned world model. Dreamer solves visual control tasks using substantially fewer episodes than strong model-free agents.
Dreamer learns a world model from past experiences that can predict the future. It then learns action and value models in its compact latent space. The value model optimizes Bellman's consistency of imagined trajectories. The action model maximizes value estimates by propagating their analytic gradients back through imagined trajectories. When interacting with the environment, it simply executes the action model.
paper: https://arxiv.org/abs/1912.01603
github: https://github.com/google-research/dreamer
site: https://danijar.com/dreamer
#RL #Dreams #Imagination #DL #GoogleBrain #DeepMind
Abstract: Learned world models summarize an agent's experience to facilitate learning complex behaviors. While learning world models from high-dimensional sensory inputs are becoming feasible through deep learning, there are many potential ways for deriving behaviors from them. We present Dreamer, a reinforcement learning agent that solves long-horizon tasks from images purely by latent imagination. We efficiently learn behaviors by propagating analytic gradients of learned state values back through trajectories imagined in the compact state space of a learned world model. On 20 challenging visual control tasks, Dreamer exceeds existing approaches in data-efficiency, computation time, and final performance.
Dreamer learns long-horizon behaviors from images purely by latent imagination. For this, it backpropagates value estimates through trajectories imagined in the compact latent space of a learned world model. Dreamer solves visual control tasks using substantially fewer episodes than strong model-free agents.
Dreamer learns a world model from past experiences that can predict the future. It then learns action and value models in its compact latent space. The value model optimizes Bellman's consistency of imagined trajectories. The action model maximizes value estimates by propagating their analytic gradients back through imagined trajectories. When interacting with the environment, it simply executes the action model.
paper: https://arxiv.org/abs/1912.01603
github: https://github.com/google-research/dreamer
site: https://danijar.com/dreamer
#RL #Dreams #Imagination #DL #GoogleBrain #DeepMind
π1
ββEncoder-decoders in Transformers: a hybrid pre-trained architecture for seq2seq
by huggingface
In this post briefly goes through the (modern) history of #transformers and the comeback of the encoder-decoder architecture.
The author walk through the implementation of encoder-decoders in the transformers library, show you can use them for your projects, and give you a taste of what is coming in the next releases.
Blog: https://medium.com/huggingface/encoder-decoders-in-transformers-a-hybrid-pre-trained-architecture-for-seq2seq-af4d7bf14bb8
by huggingface
In this post briefly goes through the (modern) history of #transformers and the comeback of the encoder-decoder architecture.
The author walk through the implementation of encoder-decoders in the transformers library, show you can use them for your projects, and give you a taste of what is coming in the next releases.
Blog: https://medium.com/huggingface/encoder-decoders-in-transformers-a-hybrid-pre-trained-architecture-for-seq2seq-af4d7bf14bb8
π1
Very cool application of GPT-2
It is a game endless generating by large GPT-2 depends of your actions which litterally anything with just words. Without any gamemaster or gamedisigner limitations) GPT-2 was fine-tuned on collection of adventures texts.
But it is work not very well, esspecially on custom setting (I try to setup cyberpunk, but it is wa a fantasy anyway sometimes))
But it is fun and very cool applications of this type of nets. And it is really awesome to be suprised each time by power of this model esspecialy in this task.
Site: http://www.aidungeon.io/
Post: https://pcc.cs.byu.edu/2019/11/21/ai-dungeon-2-creating-infinitely-generated-text-adventures-with-deep-learning-language-models/
Github: https://github.com/nickwalton/AIDungeon/
Play in colab: https://colab.research.google.com/drive/1u7flclharvMchwWHY7Ya41NKjX3dkslu#forceEdit=true&sandboxMode=true&scrollTo=FKqlSCrpS9dH
#GPT2 #NLP #NLU
It is a game endless generating by large GPT-2 depends of your actions which litterally anything with just words. Without any gamemaster or gamedisigner limitations) GPT-2 was fine-tuned on collection of adventures texts.
But it is work not very well, esspecially on custom setting (I try to setup cyberpunk, but it is wa a fantasy anyway sometimes))
But it is fun and very cool applications of this type of nets. And it is really awesome to be suprised each time by power of this model esspecialy in this task.
Site: http://www.aidungeon.io/
Post: https://pcc.cs.byu.edu/2019/11/21/ai-dungeon-2-creating-infinitely-generated-text-adventures-with-deep-learning-language-models/
Github: https://github.com/nickwalton/AIDungeon/
Play in colab: https://colab.research.google.com/drive/1u7flclharvMchwWHY7Ya41NKjX3dkslu#forceEdit=true&sandboxMode=true&scrollTo=FKqlSCrpS9dH
#GPT2 #NLP #NLU
Guide to reading articles in 202x:
1. Accept cookies
2. Block notifications
3. Deny location to website
4. Decline invitation to subscribe
5. Stop auto-playing video ads/mute sound
6. Dismiss reminder of free articles remaining
7. Shrink drop down banner
8. Click "read more"
9. Give up
1. Accept cookies
2. Block notifications
3. Deny location to website
4. Decline invitation to subscribe
5. Stop auto-playing video ads/mute sound
6. Dismiss reminder of free articles remaining
7. Shrink drop down banner
8. Click "read more"
9. Give up
ββImproving Transformer Models by Reordering their Sublayers
tl;dr β improve transformers by reordering their sublayers like the sandwich transformer
The authors trained random #transformer models with reordered sublayers, and find that some perform better than the baseline interleaved trans former in #language #modeling.
They observed that, on average, better models contain more self-attention #sublayers at the bottom and more feedforward sublayer at the top.
This leads them to design a new transformer stack, the sandwich transformer, which consistently improves performance over the baseline at no cost.
paper: https://ofir.io/sandwich_transformer.pdf
tl;dr β improve transformers by reordering their sublayers like the sandwich transformer
The authors trained random #transformer models with reordered sublayers, and find that some perform better than the baseline interleaved trans former in #language #modeling.
They observed that, on average, better models contain more self-attention #sublayers at the bottom and more feedforward sublayer at the top.
This leads them to design a new transformer stack, the sandwich transformer, which consistently improves performance over the baseline at no cost.
paper: https://ofir.io/sandwich_transformer.pdf
ββEpisodic Memory in Lifelong Language Learning
tl;dr β the model needs to learn from a stream of text examples without any dataset identifier.
The authors propose an episodic memory model that performs sparse experience replay and local adaptation to mitigate catastrophic forgetting in this setup. Experiments on text classification and question answering demonstrate the complementary benefits of sparse experience replay & local adaptation to allow the model to continuously learn from new datasets.
Also, they show that the space complexity of the episodic memory module can be reduced significantly (βΌ50-90%) by randomly choosing which examples to store in memory with a minimal decrease in performance. They consider an episodic memory component as a crucial building block of general linguistic intelligence and see the model as the first step in that direction.
paper: https://arxiv.org/abs/1906.01076
#nlp #bert #NeurIPSConf19
tl;dr β the model needs to learn from a stream of text examples without any dataset identifier.
The authors propose an episodic memory model that performs sparse experience replay and local adaptation to mitigate catastrophic forgetting in this setup. Experiments on text classification and question answering demonstrate the complementary benefits of sparse experience replay & local adaptation to allow the model to continuously learn from new datasets.
Also, they show that the space complexity of the episodic memory module can be reduced significantly (βΌ50-90%) by randomly choosing which examples to store in memory with a minimal decrease in performance. They consider an episodic memory component as a crucial building block of general linguistic intelligence and see the model as the first step in that direction.
paper: https://arxiv.org/abs/1906.01076
#nlp #bert #NeurIPSConf19