class: center, middle # Natural Language Processing with Deep Learning Charles Ollion - Olivier Grisel .affiliations[ ![IPP](images/logo_ipp.jpeg) ![Inria](images/inria-logo.png) ![UPS](images/Logo_Master_Datascience.png) ] --- ## Natural Language Processing .center[
] --- ## Natural Language Processing - Sentence/Document level Classification (topic, sentiment) - Topic modeling (LDA, ...) -- - Translation -- - Chatbots / dialogue systems / assistants (Alexa, ...) -- - Summarization --- # Recommended reading *A Primer on Neural Network Models for Natural Language Processing* by Yoav Goldberg .center[ http://u.cs.biu.ac.il/~yogo/nnlp.pdf ] -- Useful open source projects .center[
] --- # Outline
### Classification and word representation -- ### Word2Vec -- ### Language Modelling -- ### Recurrent neural networks --- class: middle, center # Word Representation and Word2Vec --- # Word representation Words are indexed and represented as 1-hot vectors -- Large Vocabulary of possible words $|V|$ -- Use of **Embeddings** as inputs in all Deep NLP tasks -- Word embeddings usually have dimensions 50, 100, 200, 300 --- # Supervised Text Classification .center[
] .footnote.small[ Joulin, Armand, et al. "Bag of tricks for efficient text classification." FAIR 2016 ] ??? Question: shape of embeddings if hidden size is H -- $\mathbf{E}$ embedding (linear projection) .right.red[`|V| x H`] -- Embeddings are averaged .right[hidden activation size: .red[`H`]] -- Dense output connection $\mathbf{W}, \mathbf{b}$ .right.red[`H x K`] -- Softmax and **cross-entropy** loss --- # Supervised Text Classification .center[
] .footnote.small[ Joulin, Armand, et al. "Bag of tricks for efficient text classification." FAIR 2016 ]
- Very efficient (**speed** and **accuracy**) on large datasets -- - State-of-the-art (or close to) on several classification, when adding **bigrams/trigrams** -- - Little gains from depth --- # Transfer Learning for Text Similar to image: can we have word representations that are generic enough to **transfer** from one task to another? -- **Unsupervised / self-supervised** learning of word representations -- **Unlabelled** text data is almost infinite: - Wikipedia dumps - Project Gutenberg - Social Networks - Common Crawl --- # Word Vectors .center[
] .footnote.small[ excerpt from work by J. Turian on a model trained by R. Collobert et al. 2008 ] ??? Question: what distance to use in such a space --- # Word2Vec .center[
] .footnote.small[ Colobert et al. 2011, Mikolov, et al. 2013 ] --
### Compositionality .center[
] --- # Word Analogies .center[
] .footnote.small[ Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." NIPS 2013 ] -- - Linear relations in Word2Vec embeddings -- - Many come from text structure (e.g. Wikipedia) --- # Self-supervised training Distributional Hypothesis (Harris, 1954): *“words are characterised by the company that they keep”* Main idea: learning word embeddings by **predicting word contexts** .footnote.small[ Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." NIPS 2013 ] -- Given a word e.g. “carrot” and any other word $w \in V$ predict probability $P(w|\text{carrot})$ that $w$ occurs in the context of “carrot”. -- - **Unsupervised / self-supervised**: no need for class labels. - (Self-)supervision comes from **context**. - Requires a lot of text data to cover rare words correctly. ??? How to train fastText like model on this? --- # Word2Vec: CBoW CBoW: representing the context as **Continuous Bag-of-Word** Self-supervision from large unlabeled corpus of text: *slide* over an **anchor word** and its **context**: .center[
] .footnote.small[ Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." NIPS 2013 ] --- # Word2Vec: CBoW CBoW: representing the context as **Continuous Bag-of-Word** Self-supervision from large unlabeled corpus of text: *slide* over an **anchor word** and its **context**: .center[
] .footnote.small[ Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." NIPS 2013 ] ??? Question: dim of output embedding vs dim of input embedding --- # Word2Vec: Details
.center[
] .footnote.small[ Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." NIPS 2013 ] - Similar as supervised CBoW (e.g. fastText) with |V| classes -- - Use **Negative Sampling**: sample *negative* words at random instead of computing the full softmax. See:
http://sebastianruder.com/word-embeddings-softmax/index.html
-- - Large impact of **context size** ??? Softmax is too computationally intensive to be practical: the normalization term involves a sum over the full vocabulary of cardinality |V| >> 10000 at each gradient step. Negative Sampling uses k=5 negative words sampled at random instead. This is not accurate enough to estimate `p(x_t|x_{t-2}, x_{t-1}, x_{t+1}, x_t{t+2})` accuractely but it's a good enough approximation to train a useful word embedding parameters. --- # Word2Vec: Skip Gram a
.center[
]
- Given the central word, predict occurence of other words in its context. -- - Widely used in practice -- - Again **Negative Sampling** is used as a cheaper alternative to full softmax. --- # Evaluation and Related methods Always difficult to evaluate unsupervised tasks - WordSim (Finkelstein et al.) - SimLex-999 (Hill et al.) - Word Analogies (Mikolov et al.) --
Other popular method: **GloVe** (Socher et al.) http://nlp.stanford.edu/projects/glove/ .footnote.small[ Pennington, Jeffrey, Richard Socher, and Christopher D. Manning. "Glove: Global Vectors for Word Representation." EMNLP. 2014 ] --- # Take Away on Embeddings **For text applications, inputs of Neural Networks are Embeddings** -- - If **little training data** and a wide vocabulary not well covered by training data, use **pre-trained self-supervised embeddings** (transfer learning from Glove, word2vec or fastText embeddings) -- - If **large training data** with labels, directly learn task-specific embedding with methods such as **fastText in supervised mode**. -- - These methods use **Bag-of-Words** (BoW): they **ignore the order** in word sequences -- - Depth & non-linear activations on hidden layers are not that useful for BoW text classification. -- **Word Embeddings** no long state of the art for NLP tasks: BERT-style pretraining of deep transformers with sub-word tokenization is now used everywhere. --- class:middle, center # Language Modelling and Recurrent Neural Networks --- ## Language Models Assign a probability to a sequence of words, such that plausible sequences have higher probabilities e.g: - $p(\text{"I like cats"}) > p(\text{"I table cats"})$ - $p(\text{"I like cats"}) > p(\text{"like I cats"})$ -- **Auto-regressive sequence modelling** $p\_{\theta}(w\_{0})$ $p_{\theta}$ is parametrized by a neural network. --- ## Language Models Assign a probability to a sequence of words, such that plausible sequences have higher probabilities e.g: - $p(\text{"I like cats"}) > p(\text{"I table cats"})$ - $p(\text{"I like cats"}) > p(\text{"like I cats"})$ **Auto-regressive sequence modelling** $p\_{\theta}(w\_{0}) \cdot p\_{\theta}(w\_{1} | w\_{0})$ $p_{\theta}$ is parametrized by a neural network. --- ## Language Models Assign a probability to a sequence of words, such that plausible sequences have higher probabilities e.g: - $p(\text{"I like cats"}) > p(\text{"I table cats"})$ - $p(\text{"I like cats"}) > p(\text{"like I cats"})$ **Auto-regressive sequence modelling** $p\_{\theta}(w\_{0}) \cdot p\_{\theta}(w\_{1} | w\_{0}) \cdot \ldots \cdot p\_{\theta}(w\_n | w\_{n-1}, w\_{n-2}, \ldots, w\_0)$ $p_{\theta}$ is parametrized by a neural network. -- The internal representation of the model can better capture the meaning of a sequence than a simple Bag-of-Words. --- ## Conditional Language Models NLP problems expressed as **Conditional Language Models**: **Translation:** $p(Target | Source)$ - *Source*: "J'aime les chats" - *Target*: "I like cats" -- Model the output word by word: $p\_{\theta}(w\_{0} | Source)$ --- ## Conditional Language Models NLP problems expressed as **Conditional Language Models**: **Translation:** $p(Target | Source)$ - *Source*: "J'aime les chats" - *Target*: "I like cats" Model the output word by word: $p\_{\theta}(w\_{0} | Source) \cdot p\_{\theta}(w\_{1} | w\_{0}, Source) \cdot \ldots $ --- ## Conditional Language Models **Question Answering / Dialogue:** $p( Answer | Question , Context )$ - *Context*: - "John puts two glasses on the table." - "Bob adds two more glasses." - "Bob leaves the kitchen to play baseball in the garden." - *Question*: "How many glasses are there?" - *Answer*: "There are four glasses." -- **Image Captionning:** $p( Caption | Image )$ - Image is usually the $2048$-d representation from a CNN ??? Question: do you have any idea of those NLP tasks that could be tackled with a similar conditional modeling approach? --- ## Simple Language Model .center[
] -- Fixed context size - **Average embeddings**: (same as CBoW) no sequence information -- - **Concatenate embeddings**: introduces many parameters -- - **1D convolution**: larger contexts and limit number of parameters -- - Still does not take well into account varying sequence sizes and sequence dependencies ??? Question: What's the dimension of concatenate embeddings? --- ## Recurrent Neural Network .center[
] -- Unroll over a sequence $(x_0, x_1, x_2)$: .center[
] --- ## Recurrent Neural Network .center[
] Unroll over a sequence $(x_0, x_1, x_2)$: .center[
] --- ## Recurrent Neural Network .center[
] Unroll over a sequence $(x_0, x_1, x_2)$: .center[
] --- ## Language Modelling .center[
] **input** $(w\_0, w\_1, ..., w\_t)$ .small[ sequence of words ( 1-hot encoded ) ]
**output** $(w\_1, w\_2, ..., w\_{t+1})$ .small[shifted sequence of words ( 1-hot encoded ) ] --- ## Language Modelling .center[
] $x\_t = \text{Emb}(w\_t) = \mathbf{E} w\_t$ .right[input projection .red[`H`]] -- $h\_t = g(\mathbf{W^h} h\_{t-1} + x\_t + b^h)$ .right[recurrent connection .red[`H`]] -- $y = \text{softmax}( \mathbf{W^o} h\_t + b^o )$ .right[output projection .red[`K = |V|`]] --- ## Recurrent Neural Network .center[
] Input embedding $\mathbf{E}$ .right[.red[`|V| x H`]] -- Recurrent weights $\mathbf{W^h}$ .right[.red[`H x H`]] -- Output weights $\mathbf{W^{out}}$ .right[ .red[`H x K = H x |V|`]] --- ## Backpropagation through time Similar as standard backpropagation on unrolled network .center[
] --- ## Backpropagation through time Similar as standard backpropagation on unrolled network .center[
] --- ## Backpropagation through time Similar as standard backpropagation on unrolled network .center[
]
-- - Similar as training **very deep networks** with tied parameters - Example between $x_0$ and $y_2$: $W^h$ is used twice -- - Usually truncate the backprop after $T$ timesteps -- - Difficulties to train long-term dependencies --- ## Other uses: Sentiment Analysis .center[
] - Output is sentiment (1 for positive, 0 for negative) -- - Very dependent on words order -- - Very flexible network architectures --- ## Other uses: Sentiment analysis .center[
] - Output is sentiment (1 for positive, 0 for negative) - Very dependent on words order - Very flexible network architectures --- # LSTM .center[
] .footnote.small[ Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term memory." Neural computation 1997 ] --- # LSTM .center[
] .footnote.small[ Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term memory." Neural computation 1997 ] --- # LSTM .center[
] .footnote.small[ Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term memory." Neural computation 1997 ] -- - 4 times more parameters than RNN -- - Mitigates **vanishing gradient** problem through **gating** -- - Widely used and SOTA in many sequence learning problems --- .footnote.small[ Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term memory." Neural computation 1997 ] $\mathbf{ u} = \sigma(\mathbf{W^u} \cdot h\_{t-1} + \mathbf{I^u} \cdot x\_t + b^u)$ .right[Update gate .red[`H`]] -- $\mathbf{ f} = \sigma(\mathbf{W^f} \cdot h\_{t-1} + \mathbf{I^f} \cdot x\_t + b^f)$ .right[Forget gate .red[`H`]] -- $\mathbf{ \tilde{c\_t}} = \tanh(\mathbf{W^c} \cdot h\_{t-1} + \mathbf{I^c} \cdot x\_t + b^c)$ .right[Cell candidate .red[`H`]] -- $\mathbf{ c\_t} = \mathbf{f} \odot \mathbf{c\_{t-1}} + \mathbf{u} \odot \mathbf{ \tilde{c\_t}}$ .right[Cell output .red[`H`]] -- $\mathbf{ o} = \sigma(\mathbf{W^o} \cdot h\_{t-1} + \mathbf{I^o} \cdot x\_t + b^o)$ .right[Output gate .red[`H`]] -- $\mathbf{ h\_t} = \mathbf{o} \odot \tanh(\mathbf{c\_t})$ .right[Hidden output .red[`H`]] -- $y = \text{softmax}( \mathbf{W} \cdot h\_t + b )$ .right[Output .red[`K`]] --
$W^u, W^f, W^c, W^o$ .right[Recurrent weights .red[`H x H`]] $I^u, I^f, I^c, I^o$ .right[Input weights .red[`N x H`]] --- # GRU Gated Recurrent Unit: similar idea as LSTM .footnote.small[ Chung, Junyoung, et al. "Gated Feedback Recurrent Neural Networks." ICML 2015 ] - less parameters, as there is one gate less - no "cell", only hidden vector $h_t$ is passed to next unit -- In practice - more recent, people tend to use LSTM more - no systematic difference between the two --- ## Vanishing / Exploding Gradients Passing through $t$ time-steps, the resulting gradient is the **product** of many gradients and activations. -- - Gradient messages close to $0$ can shrink be $0$ - Gradient messages larger than $1$ can explode -- - **LSTM / GRU** mitigate that in RNNs - **Additive path** between $c\_t$ and $c\_{t-1}$ -- - **Gradient clipping** prevents gradient explosion - Well chosen **activation function** is critical (tanh) -- **Skip connections** in ResNet also alleviate a similar optimization problem. --- class: middle, center # Lab 5: back here in 15 min!