DL in NLP – Telegram

DL in NLP

12.5K subscribers

547 photos

13 videos

27 files

1.1K links

Новости и обзоры статей на тему обработки естественного языка, нейросетей и всего такого.

Связь: @dropout05 (рекламы нет)

About

Blog

Apps

Platform

12.5K subscribers

Отличная презентация с последнего EMNLP

A SOTA-less, novelty-less journey into neural sequence models

TL;DR
Когда появился нейронный машинный перевод он не был SOTA, его улучшения в своё время не были новыми идеями. Но сейчас это де-факто стандарт в рисёче и проде.
Сейчас происходит много примеров неавторегрессионной генерации текста, но они не SOTA и их улучшения не новые идеи.

Очень советую почитать презентацию, там куда больше интересного, чем может поместиться в TL;DR.

https://drive.google.com/file/d/1HGzv6n9hAj-GL63POUZCO6nCrIHF9y35/view

1.23K viewsVlad Lialin, 18:00

Evaluating Combinatorial Generalization in Variational Autoencoders
Bozkurt, Esmaeili, et al. Northeastern University
arxiv.org/abs/1911.04594

The paper studies how well shallow and deep VAEs are able to generalize in different dataset split settings. They try two different dataset split techniques: “easy” and “hard” generalization problem and change dataset size “small dataset” vs. “big dataset.”

VAEs are trained to autoencode MNIST images.

First, they study how well VAE memorizes the training set. Deep models memorize it more then shallow and reuse memorized examples to extrapolate to reconstruct unseen data. Particularly, they find that the reconstructions of unseen data (e.g., some class in MNIST that was absent during training) are closer to training examples in a deep model.

Their study is consistent with the work of Belkin et al. 2018 in the case of “easy” generalization - deep models generalize better with increased capacity. But in the case of “hard” generalization, deeper models perform worse as the capacity increases.

Also, they found that increasing data amount helps deep models to generalize much more than it helps shallow.

I think this is the first paper in a long time with both MNIST and exciting findings.

1.31K viewsnlpcontroller_bot, 18:47

1.27K viewsnlpcontroller_bot, 18:47

1.45K viewsnlpcontroller_bot, 18:47

Evaluating_highlighted.pdf

2.36K viewsnlpcontroller_bot, 18:47

Когда НЛП встречается с МММ.
Потому что мы удачливы на аббревиатуры.

arxiv.org/abs/1910.00458

MMM: Multi-stage Multi-task Learning for Multi-choice Reading Comprehension

Machine Reading Comprehension (MRC) for question answering (QA), which aims
to answer a question given the relevant context passages, is an important way
to test the ability of intelligence...

1.85K viewsnlpcontroller_bot, 07:40

Energy-Based Self-Supervised Learning
Yan LeCun

Так как поток интересных NLP-статей как-то затих, держите ещё одну презентацию. Слайды с очень крутой лекции ЛеКуна о предобучении и моделях с латентными переменными. Был на такой же его лекции в Гарварде, было интересно.

https://drive.google.com/file/d/1NCLbdkIDaU1ZvZ3dp7xi7CGhxKRgWChw/view

1.88K viewsnlpcontroller_bot, 08:26

1.25K viewsnlpcontroller_bot, 08:26

via twitter.com/EricTopol

X (formerly Twitter)

Eric Topol (@EricTopol) on X

physician-scientist, author of SUPER AGERS https://t.co/ZEdooyyJpP
and Ground Truths: https://t.co/YhatcBT0hA

1.24K viewsnlpcontroller_bot, 08:26

The lottery ticket hypothesis suggests that by training DNNs from “lucky” initializations, we can train networks which are 10-100x smaller with minimal performance losses. In new work, we extend our understanding of this phenomenon in several ways... https://ai.facebook.com/blog/understanding-the-generalization-of-lottery-tickets-in-neural-networks https://twitter.com/facebookai/status/1199042155743862784/video/1

Do lottery tickets contain generic inductive biases or are they overfit to the particular dataset and optimizer used to find them? Encouragingly, we found that lottery tickets generalize across related, but distinct datasets and across optimizers: https://arxiv.org/abs/1906.02773

Is the lottery ticket phenomenon a general property of DNNs or merely an artifact of supervised image classification? We show that the lottery ticket phenomenon is a general property which is present in both #reinforcementlearning and #NLP

Can we begin to explain lottery tickets theoretically? We introduce a new theoretical framework on the formation of lottery tickets to help researchers advance toward a better understanding of lucky initializations

Via twitter.com/facebookai/status/1199042159334154241

Understanding the generalization of ‘lottery tickets’ in neural networks

The lottery ticket hypothesis suggests that by training DNNs from “lucky” initializations, we can train networks which are 10-100x smaller with minimal performance losses. In new work, we extend our understanding of this phenomenon in several ways.

1.4K viewsnlpcontroller_bot, 11:02

И немного новостей из параллельного (но очень близкого) NLP мира

We just released the paper and code for Mellotron: a multispeaker voice synthesis model that can make a voice emote and sing without emotive or singing training data.
https://github.com/NVIDIA/mellotron

Via twitter.com/RafaelValleArt/status/1199017762774900738

GitHub - NVIDIA/mellotron: Mellotron: a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote…

Mellotron: a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data - NVIDIA/mellotron

1.76K viewsnlpcontroller_bot, 11:02

Forwarded from Anya

https://twitter.com/Smerity/status/1199529360954257408

Introducing the SHA-RNN :) - Read alternative history as a research genre - Learn of the terrifying tokenization attack that leaves language models perplexed - Get near SotA results on enwik8 in hours on a lone GPU No Sesame Street or Transformers allowed.…

1.51K viewsVlad Lialin, 18:10

Forwarded from Petr Ostroukhov

Новый гайд от Jay Alammar по использованию BERT https://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/

jalammar.github.io

A Visual Guide to Using BERT for the First Time

Translations: Chinese, Korean, Russian

Progress has been rapidly accelerating in machine learning models that process language over the last couple of years. This progress has left the research lab and started powering some of the leading digital…

1.57K viewsCookie Thief, 13:29

Not Enough Data? Deep Learning to the Rescue!
Anaby-Tavor et al
https://arxiv.org/abs/1911.03118

Метод аугментации текстового датасета через генерацию примеров на основе языковой модели. Эксперименты не очень убедительны, но вроде бы метод даёт хороший буст, если примеров класса меньше 10.

1.53K viewsnlpcontroller_bot, edited 19:01

1.51K viewsnlpcontroller_bot, 19:01

1.51K viewsnlpcontroller_bot, 19:01

How Much Over-parameterization Is Sufficient to Learn Deep ReLU Networks?
Chen et al., 2019 [UCLA]
arxiv.org/abs/1911.12360

The theory of deep learning is a new and fast-developing field. Recent studies suggest that huge over-parametrization of neural networks is not a bug, but a feature that allows deep NNs both to generalize and to be optimizable using simple (first-order gradient) optimization.
Chen et al. make another step into solving mysteries of deep learning, and their main results are:
1. Sharp optimization and generalization guarantees for deep ReLU networks
1. Better asymptotics that allows applying the theory to smaller networks (polylogarithmic instead of polynomial hidden size)
As authors say, "Our results push the study of over-parameterized deep neural networks towards more practical settings."

For a deep dive to a theory of deep learning, I suggest
iPavlov: github.com/deepmipt/tdl (Russian and English)
Stanford: stats385.github.io (English)

GitHub - deeppavlov/tdl: Course "Theories of Deep Learning"

Course "Theories of Deep Learning". Contribute to deeppavlov/tdl development by creating an account on GitHub.

1.36K viewsnlpcontroller_bot, edited 20:59

Do Attention Heads in BERT Track Syntactic Dependencies?
Mon Htut et al. [NYU]
arxiv.org/abs/1911.12246

“Our results suggest that these models have some specialist attention heads that track individual dependency types, but no generalist head that performs holistic parsing significantly better than a trivial baseline, and that analyzing attention weights directly may not reveal much of the syntactic knowledge that BERT-style models are known to learn.

We also analyze BERT fine-tuned on two datasets — the syntax-oriented CoLA and the semantics-oriented MNLI — but we do not observe substantial differences in the overall dependency relations extracted using our methods.”

1.38K viewsnlpcontroller_bot, 21:07

1.39K viewsnlpcontroller_bot, 21:07

Forwarded from Говорит AI (Nicolas Ivanov)

The Dialogue Dodecathlon
https://parl.ai/projects/dodecadialogue/

TL;DR
Собрали вместе 12 диалоговых датасетов, обучили на них transformer-based seq2seq модель в multitasking режиме и получили SOTA на всех 12 задачах.

Суть подхода
Две идеи для обучения генеративной диалоговой модели, работающей в open-domain сеттинге:
1. Для предобучения лучше использовать диалоговые данные (Reddit), а не произвольные текстовые (например, WebText, на котором учили GPT2).
2. Лучше учить модели в multi-tasking режиме:
- во-первых, удобно иметь одну универсальную модель, а не 10-20 специализированных;
- во-вторых, в теории обучение на одних задачах может помочь в достижении хороших результатах на других; поэтому среди рассмотренных в статье датасетов есть не только текстовые, но и QA-датасеты по картинкам.

Датасеты, рассмотренные в статье:
- ConvAI - кондишен на факты о персоне
- DailyDialog - обсуждение разных повседневных тем
- Wiz. of Wikipedia - кондишен на факты из википедии
- Empathetic Dialog - обсуждение жизненные ситуаций в дружелюбной (терапевтической) манере
- Cornell Movie - субтитры
- LIGHT - roll-play в выдуманных ситуациях
- ELI5 - вопросы и ответы в длинной форме
- Ubuntu - чат поддержки
- Twitter - twitter
- pushshift.io Reddit - 2.2 миллиарда предложений с реддита на разные темы
- Image Chat - обсуждение персон на картинках
- IGC - вопросы и ответы по картинкам на разные темы

Результаты
В качестве бейзлайна взяли предобученную GPT2-модель в реализации hugging face'a (которую в статье почему-то называют BERT'ом).
В качестве конкурента использовали transformer-based seq2seq модель из своего ParlAI, в которую в частности добавили возможность кондишениться на фичи, извлеченные из картинок.

Вывод 1.
Лучшая стратегия претрейна для диалоговых моделей - обучаться на огромном датасетете pushshifit.io Reddit (2.2 миллиарда предложений). Претрейн на твиттере и использование весов GPT2 существенно проигрывает по perplexity.
Для справки - свой seq2seq на Reddit'e они они учили две недели на 64-ех Nvidia V100.

Вывод 2.
Если после предобучения на Reddit'e доучивать модель на всех 12-задачах в multi-tasking режиме, уже получается универсальная модель, которая бьет почти все предыдущие task-specific модели по perplexity и специфичным метрикам типа BLEU / ROUGE / F1.

Вывод 3.
Наиболее результативным остается подход с finetune'ом модели на конкретную задачу: сначала идет предобучение на Reddit'a, потом обучение на всех задачах в multitasking режими, а потом finetune на конкретную задачу. При таком подходе получаются новые SOTA-модели для всех 12 задач.

Вывод 4.
Пожалуй, самый интересный результат статьи связан с так называемым Leave-One-Out Zero-Shot Performance: дообучаемся в multitasking режиме на всех датасетах, кроме одного, а тестируемся на оставшемся.
Авторы статьи показали, что и в этом случае метрики на новом датасете также очень приличные (если только не выкидывать Reddit из дообучения), что говорит о том, что multitasking-обучение способствует лучшему обобщению модели и "переносу знаний" на новые домены.

80 viewsVlad Lialin, 16:10

Хипстеры из fast.ai переизобрели Jupyter-ноутбуки. Нужно будет попробовать.

fast.ai/2019/12/02/nbdev

Трэд в твиттере с примерами:
twitter.com/jeremyphoward/status/1201447678346842112

1.74K viewsVlad Lialin, edited 16:19