Definition detection (v0.2)


Title / arxiv/acl-ID Keyphrase Definition Glossary
Supervised Domain Enablement Attention for Personalized Domain Classification 1812.07546 2018 D18-1106
intelligent personal digital assistants
domain classification
natural language understanding
deep learning
personal digital assistants

Introduction

Due to recent advances in deep learning techniques , intelligent personal digital assistants ( IPDAs ) such as Amazon Alexa , Google Assistant , Microsoft Cortana , and Apple Siri have been widely used as real - life applications of natural language understanding
BIBREF0
,
BIBREF1
.


In natural language understanding , domain classification is a task that finds the most relevant domain given an input utterance
BIBREF2
.
For example , “ make a lion sound ” and “ find me an apple pie recipe ” should be classified as ZooKeeper and AllRecipe , respectively . Recent IPDAs cover more than several thousands of diverse domains by including third - party developed domains such as Alexa Skills
BIBREF3
,
BIBREF4
,
BIBREF5
, Google Actions , and Cortana Skills , which makes domain classification to be a more challenging task .

Given a large number of domains , leveraging user 's enabled domain information has been shown to improve the domain classification performance since enabled domains reflect the user 's context in terms of domain usage
BIBREF6
. For an input utterance ,
BIBREF6
use attention mechanism so that a weighted sum of the enabled domain vectors are used as an input signal as well as the utterance vector . The enabled domain vectors and the attention weights are automatically trained in an end - to - end fashion to be helpful for the domain classification .

In this paper , we propose a supervised enablement attention mechanism for more effective attention on the enabled domains . First , we use logistic sigmoid instead of softmax as the attention activation function to relax the constraint that the weight sum over all the enabled domains is 1 to the constraint that each attention weight is between 0 and 1 regardless of the other weights
BIBREF7
,
BIBREF8
. Therefore , all the attention weights can be very low if there are no enabled domains relevant to a ground - truth so that we can disregard the irrelevant enabled domains , and multiple attention weights can have high values when multiple enabled domains are helpful for disambiguating an input utterance . Second , we encourage each attention weight to be high if the corresponding enabled domain is a ground - truth domain and if otherwise , to be low , by a supervised attention method
BIBREF9
so that the attention weights can be directly tuned for the downstream classification task . Third , we apply self - distillation BIBREF10 on top of the enablement attention weights so that we can better utilize the enabled domains that are not ground - truth domains but still relevant .

Evaluating on datasets obtained from real usage in a large - scale IPDA , we show that our approach significantly improves domain classification performance by utilizing the domain enablement information effectively .

Model

Figure
FIGREF2
shows the overall architecture of the proposed model .

Given an input utterance , each word of the utterance is represented as a dense vector through word embedding followed by bidirectional long short - term memory ( BiLSTM )
BIBREF11
.
Then , an utterance vector is composed by concatenating the last outputs of the forward LSTM and the backward LSTM .

To represent the domain enablement information , we obtain a weighted sum of domain enablement vector where the weights are calculated by logistic sigmoid function on top of the multiplicative attention
BIBREF14
for the utterance vector and the domain enablement vectors . The attention weight of an enabled domain e is formulated as follows : a e =σu·v e ,

where u is the utterance vector , v e is the enablement vector of enabled domain e , and σ is sigmoid function . Compared to conventional attention mechanism using softmax function , which constraints the sum of the attention weights to be 1 , sigmoid attention has more expressive power , where each attention weight can be between 0 and 1 regardless of the other weights . We show that using sigmoid attention is actually more effective for improving prediction performance in Section SECREF3 .

The utterance vector and the weighted sum of the domain enablement vectors are concatenated to represent the utterance and the domain enablement as a single vector . Given the concatenated vector , a feed - forward neural network with a single hidden layer is used to predict the confidence score by logistic sigmoid function for each domain .

One issue of the proposed architecture is that the domain enablement can be trained to be a very strong signal , where one of the enabled domains would be the predicted domains regardless of the relevancy of the utterances to the predicted domains in many cases . To reduce this prediction bias , we use randomly sampled enabled domains instead of the correct enabled domains of an input utterance with 50 % probability during training so that the domain enablement is used as an auxiliary signal rather than determining signal . During inference , we always use the correct domain enablements of the given utterances .

The main loss function of our model is formulated as binary log loss between the confidence score and the ground - truth vector as follows : ℒ m =-∑ i=1 n y i logo i +1-y i log1-o i ,

where n is the number of all domains , o is an n -dimensional confidence score vector from the model , and y is an n -dimensional one - hot vector whose element corresponding to the position of the ground - truth domain is set to 1 .

Supervised Enablement Attention

Attention weights are originally intended to be automatically trained in an end - to - end fashion
BIBREF16
, but it has been shown that applying proper explicit supervision to the attention improves the downstream tasks such as machine translation given the word alignment and constituent parsing given annotations between surface words and nonterminals
BIBREF9
,
BIBREF17
,
BIBREF18
.

We hypothesize that if the ground - truth domain is one of the enabled domains , the attention weight for the ground - truth domain should be high and vice versa . To apply this hypothesis in the model training as a supervised attention method , we formulate an auxiliary loss function as follows : ℒ a =-∑ e∈E y e loga e +1-y e log1-a e ,

where E is a set of enabled domains and a e is the attention weight for the enabled domain e .

Self-Distilled Attention

One issue of supervised attention in Section SECREF5 is that enabled domains that are not ground - truth domains are encouraged to have lower attention weights regardless of their relevancies to the input utterances and the ground - truth domains . Distillation methods utilize not only the ground - truth but also all the output activations of a source model so that all the prediction information from the source model can be utilized for more effective knowledge transfer between the source model and the target model
BIBREF19
. Self - distillation , which trains a model leveraging the outputs of the source model with the same architecture or capacity , has been shown to improve the target model 's performance with a distillation method BIBREF10 .

We use a variant of self - distillation methods , where the model outputs at the previous epoch with the best dev set performance are used as the soft targets for the distillation , so that the enabled domains that are not ground - truths can also be used for the supervised attention . While conventional distillation methods utilize softmax activations as the target values , we show that distillation on top of sigmoid activations is also effective without loss of generality . The loss function for the self - distillation on the attention weights is formulated as follows : ℒ d =-∑ e∈E a e ˜loga e +1-a e ˜log1-a e ,

where a e ˜ is the attention weight of the model showing the dev set performance in the previous epochs . It is formulated as : a e ˜=σu·v e T,

where T is the temperature for sufficient usage of all the attention weights as the soft target . In this work , we set T to be 16 , which shows the best dev set performance .

We have also evaluated soft - target regularization BIBREF21 , where a weighted sum of the hard ground - truth target vector and the soft target vector is used as a single target vector , but it did not show better performance than self - distillation .

All the described loss functions are added to compose a single loss function as follows : ℒ=ℒ m +α1-βℒ a +β t ℒ d ,

where α is a coefficient representing the degree of supervised enablement attention and β t denotes the degree of the self - distillation . We set α to be 0.01 in this work . Following
BIBREF22
, β t =1-0.95 t , where t denotes the current training epoch starting from 0 so that the hard ground - truth targets are more influential in the early epochs and the self - distillation is more utilized in the late epochs .


Experiments

We evaluate our proposed model on domain classification leveraging enabled domains . The enabled domains can be a crucial disambiguating signal especially when there are multiple similar domains . For example , assume that the input utterance is “ what 's the weather ” and there are multiple weather - related domains such as NewYorkWeather , AccuWeather , and WeatherChannel . In this case , if WeatherChannel is included as an enabled domain of the current user , it is likely to be the most relevant domain to the user .

Datasets

Following the data collection methods used in
BIBREF6
, our models are trained using utterances with explicit invocation patterns . For example , given a user 's utterance , “ Ask { ZooKeeper } to { play peacock sound } , ” “ play peacock sound ” and ZooKeeper are extracted to compose a pair of the utterance and the ground - truth , respectively . In this way , we have generated train , development , and test sets containing 4.4 M , 500 K , and 500 K utterances , respectively . All the utterances are from the usage log of Amazon Alexa and the ground - truth of each utterance is one of 1 K frequently used domains . The average number of enabled domains per utterance in the test sets is 8.47 .

One issue of this collected data sets is that the ground - truth is included in the enabled domains for more than 90 % of the utterances , where the ground - truths are biased to enabled domains . For more correct and unbiased evaluation of the models on the input utterances from real live traffic , we also evaluate the models on the same sized train , development , and test sets where the utterances are sampled to set the ratio of ground - truth inclusion in enabled domains to be 70 % , which is closer to the ratio for actual input traffic .

Results

Table
TABREF8
shows the accuracies of our proposed models on the two test sets . We also show mean reciprocal rank ( MRR ) and top-3 , accuracy which is meaningful when utilizing post reranker , but we do not cover reranking issues in this paper
BIBREF23
,
BIBREF4
.

From Table
TABREF8
, we can first see that changing softmax attention to sigmoid attention significantly improves the performance . This means that having more expressive power for the domain enablement information by relaxing the softmax constraint is effective in terms of leveraging the domain enablement information for domain classification . Along with sigmoid attention , supervised attention leveraging ground - truth slightly improves the performance , and supervised attention combined with self - distillation shows significant performance improvement . It demonstrates that supervised domain enablement attention leveraging ground - truth enabled domains is helpful , and utilizing attention information from other enabled domains is synergistic .

BIBREF6
's model also adds a domain enablement bias vector to the final output , which is helpful when the ground - truth domain is one of the enabled domains . Such models ( 5 ) and ( 6 ) also show good performance for the test set where the ground - truth is one of the enabled domains with more than 90 % probability . However , for the unbiased test set where the ground - truth is included in the enabled domains with a smaller probability , not adding the bias vector is shown to be better overall .

Table
TABREF9
shows sample utterances correctly predicted with model ( 4 ) but not with model ( 1 ) and ( 2 ) . For the first two utterances , the ground - truths are included in the enabled domains , but there were only hundreds or fewer training instances whose ground - truths are CryptoPrice or Expedia . In these cases , we can see that model ( 1 ) attends to unrelated domains , model ( 2 ) attends to none of the enabled domains , but model ( 4 ) , which uses supervised attention , is shown to attend to the ground - truth even without many training examples . “ find my phone ” has a single enabled domain which is not a ground - truth . In this case , model ( 1 ) still fully attends to the unrelated domain because of softmax attention while model ( 2 ) and ( 4 ) do not highly attend to it so that the unrelated enabled domain is not impactive .

Implementation Details

The word vectors are initialized with off - the - shelf GloVe vectors
BIBREF24
, and all the other model parameters are initialized with Xavier initialization BIBREF25 . Each model is trained for 25 epochs and the parameters showing the best performance on the development set are chosen as the model parameters . We use ADAM
BIBREF26
for the optimization with the initial learning rate 0.0002 and the mini - batch size 128 . We use gradient clipping , where the threshold is set to 5 . We use a variant of LSTM , where the input gate and the forget gate are coupled and peephole connections are used BIBREF27 ,
BIBREF28
. We also use variational dropout for the LSTM regularization BIBREF29 . All the models are implemented with DyNet
BIBREF30
.

Conclusion

We have introduced a novel domain enablement attention mechanism improving domain classification performance utilizing domain enablement information more effectively . The proposed attention mechanism uses sigmoid attentions for more expressive power of the attention weights , supervised attention leveraging ground - truth information for explicit guidance of the attention weight training , and self - distillation for the attention supervision leveraging enabled domains that are not ground truth domains . Evaluating on utterances from real usage in a large - scale IPDA , we have demonstrated that our proposed model significantly improves domain classification performance by better utilizing domain enablement information .

digital assistants real - life applications of natural language understanding BIBREF0
domain classification a task that finds the most relevant domain given an input utterance BIBREF2
term short -
α a coefficient representing the degree of supervised enablement attention
β t the degree of the self - distillation
y an INLINEFORM4 -dimensional one - hot vector whose element corresponding to the position of the ground - truth domain 1
Self - distillation a model leveraging the outputs of the source model with the same architecture or capacity
t the current training
Posterior-regularized REINFORCE for Instance Selection in Distant Supervision 1904.08051 2019 N19-1290
natural language
bag inlineform11
bag inlineform1
bag inlineform9
natural language processing
entity pair

Introduction

Relation extraction is a fundamental work in natural language processing . Detecting and classifying the relation between entity pairs from the unstructured document , it can support many other tasks such as question answering .

While relation extraction requires lots of labeled data and make methods labor intensive ,
BIBREF0
proposes distant supervision ( DS ) , a widely used automatic annotating way . In distant supervision , knowledge base ( KB ) , such as Freebase , is aligned with nature documents . In this way , the sentences which contain an entity pair in KB all express the exact relation that the entity pair has in KB . We usually call the set of instances that contain the same entity pair a bag . In this way , the training instances can be divided into N bags 𝐁={B 1 ,B 2 ,...,B N } . Each bag B k are corresponding to an unique entity pair E k =(e 1 k ,e 2 k ) and contains a sequence of instances {x 1 k ,x 2 k ,...,x |B k | k } . However , distant supervision may suffer a wrong label problem . In other words , the instances in one bag may not actually have the relation .

To resolve the wrong label problem , just like Fig.2 shows ,
BIBREF1
model the instance selection task in one bag B k as a sequential decision process and train an agent π(a|s,θ π ) denoting the probability P π (A t =a,|S t =s,θ t =θ π ) that action a is taken at time t given that the agent is in state s with parameter vector θ π by REINFORCE algorithm
BIBREF2
.
The action a can only be 0 or 1 indicating whether an instance x i k is truly expressing the relation and whether it should be selected and added to the new bag B k ¯ . The state s is determined by the entity pair corresponding to the bag , the candidate instance to be selected and the instances that have already been selected . Accomplishing this task , the agent gets a new bag B k ¯ at the terminal of the trajectory with less wrong labeled instances . With the newly constructed dataset 𝐁 ¯={B 1 ¯,B 2 ¯,...,B N ¯} with less wrong labeling instances , we can train bag level relation predicting models with better performance . Meanwhile , the relation predicting model gives reward to the instance selection agent . Therefore , the agent and the relation classifier can be trained jointly .

However , REINFORCE is a Monte Carlo algorithm and need stochastic gradient method to optimize . It is unbiased and has good convergence properties but also may be of high variance and slow to train
BIBREF2
.

Therefore , we train a REINFORCE based agent by integrating some other domain - specific rules to accelerate the training process and guide the agent to explore more effectively and learn a better policy . Here we use a rule pattern as the Fig.1 shows
BIBREF3
. The instances that return true ( match the pattern and label in any one of the rules ) are denoted as x MI and we adopt posterior regularization method
BIBREF4
to regularize the posterior distribution of π(a|s,θ π ) on x MI . In this way , we can construct a rule - based agent π r . π r tends to regard the instances in x MI valuable and select them without wasting time in trial - and - error exploring . The number of such rules is 134 altogether and can match nearly four percents of instances in the training data .

Our contributions include :

RELATED WORK

Among the previous studies in relation extraction , most of them are supervised methods that need a large amount of annotated data
BIBREF5
. Distant supervision is proposed to alleviate this problem by aligning plain text with Freebase . However , distant supervision inevitably suffers from the wrong label problem .

Some previous research has been done in handling noisy data in distant supervision . An expressed - at - least - once assumption is employed in
BIBREF0
: if two entities participated in a relation , at least one instance in the bag might express that relation .
Many follow - up studies adopt this assumption and choose a most credible instance to represent the bag .
BIBREF6
,
BIBREF7
employs the attention mechanism to put different attention weight on each sentence in one bag and assume each sentence is related to the relation but have a different correlation .


Another key issue for relation extraction is how to model the instance and extract features ,
BIBREF8
,
BIBREF9
,
BIBREF10
adopts deep neural network including CNN and RNN , these methods perform better than conventional feature - based methods .

Reinforcement learning has been widely used in data selection and natural language processing .
BIBREF1
adopts REINFORCE in instance selection for distant supervision which is the basis of our work .

Posterior regularization
BIBREF4
is a framework to handle the problem that a variety of tasks and domains require the creation of large problem - specific annotated data . This framework incorporates external problem - specific information and put a constraint on the posterior of the model . In this paper , we propose a rule - based REINFORCE based on this framework .

Methodology

In this section , we focus on the model details . Besides the interacting process of the relation classifier and the instance selector , we will introduce how to model the state , action , reward of the agent and how we add rules for the agent in training process .

basic relation classifier

We need a pretrained basic relation classifier to define the reward and state . In this paper , we adopt the BGRU with attention bag level relation classifier f b
BIBREF10
. With o denoting the output of f b corresponding to the scores associated to each relation , the conditional probability can be written as follows : DISPLAYFORM0

where r is relation type , n r is the number of relation types , θ b is the parameter vector of the basic relation classifier f b and B k denotes the input bag of the classifier .

In the basic classifier , the sentence representation is calculated by the sentence encoder network BGRU , the BGRU takes the instance x i k as input and output the sentence representation BGRU ( x i k ) . And then the sentence level(ATT ) attention will take {BGRU(x 1 k ),BGRU(x 2 k ),...,BGRU(x |B k | k )} as input and output o which is the final output of f b corresponding to the scores associated to each relation .

Original REINFORCE

Original REINFORCE agent training process is quite similar to
BIBREF1
. The instance selection process for one bag is completed in one trajectory . Agent π(a|s,θ π ) is trained as an instance selector .

The key of the model is how to represent the state in every step and the reward at the terminal of the trajectory . We use the pretrained f b to address this key problem . The reward defined by the basic relation classifier is as follows : DISPLAYFORM0

In which r k denotes the corresponding relation of B k .

The state s mainly contained three parts : the representation of the candidate instance , the representation of the relation and the representation of the instances that have been selected .

The representation of the candidate instance are also defined by the basic relation classifier f b . At time step t , we use BGRU ( x t k ) to represent the candidate instance x t k and the same for the selected instances . As for the embedding of relation , we use the entity embedding method introduced in TransE model
BIBREF11
which is trained on the Freebase triples that have been mentioned in the training and testing dataset , and the relation embedding re k will be computed by the difference of the entity embedding element - wise .


The policy π with parameter θ π ={W,b} is defined as follows : DISPLAYFORM0

With the model above , the parameter vector can be updated according to REINFORCE algorithm
BIBREF2
.

Posterior Regularized REINFORCE

REINFORCE uses the complete return , which includes all future rewards up until the end of the trajectory . In this sense , all updates are made after the trajectory is completed
BIBREF2
. These stochastic properties could make the training slow . Fortunately , we have some domain - specific rules that could help to train the agent and adopt posterior regularization framework to integrate these rules . The goal of this framework is to restrict the posterior of π . It can guide the agent towards desired behavior instead of wasting too much time in meaninglessly exploring .

Since we assume that the domain - specific rules have high credibility , we designed a rule - based policy agent π r to emphasize their influences on π . The posterior constrains for π is that the policy posterior for x MI is expected to be 1 which indicates that agent should select the x MI . This expectation can be written as follows : DISPLAYFORM0

where l here is the indicator function . In order to transfer the rules into a new policy π r , the KL divergence between the posterior of π and π r should be minimized , this can be formally defined as DISPLAYFORM0

Optimizing the constrained convex problem defined by Eq . ( 4 ) and Eq . ( 5 ) , we get a new policy π r : DISPLAYFORM0

where Z is a normalization term . Z=∑ A t =0 1 P π r (A t |X,θ π )exp(𝐥(A t =1)-1)

Algorithm 1 formally define the overall framework of the rule - based data selection process .

[ t ] Original DS Dataset : 𝐁={B 1 ,B 2 ,...,B N } , Max Episode : M , Basic Relation Classifier : f b , Step Size : α An Instance Selector initialization policy weight θ π ' =θ π initialization classifier weight θ b ' =θ b episode m=1 to M B k in B B k ={x 1 k ,x 2 k ,...,x |B k | k },B k ¯={} step i in |B k | construct s i by B k ¯,x i k ,re k x i k ∈x MI construct π r sample action A i follow π r (a|s i ,θ π ' ) sample action A i follow π(a|s i ,θ π ' ) A i = 1 Add x i k in B k ¯ Get terminal reward : R=logP f b (r k |B k ¯,θ b ' ) Get step delayed reward : R i = R Update agent : θ π ←θ π +α∑ i=1 |B k | R i ∇ θ π logπ θ π ' =τθ π +(1-τ)θ π ' Update the classifier f b PR REINFORCE

Experiment

Our experiment is designed to demonstrate that our proposed methodologies can train an instance selector more efficiently .

We tuned our model using three - fold cross validation on the training set . For the parameters of the instance selector , we set the dimension of entity embedding as 50 , the learning rate as 0.01 . The delay coefficient τ is 0.005 . For the parameters of the relation classifier , we follow the settings that are described in
BIBREF10
.

The comparison is done in rule - based reinforcement learning method , original reinforcement learning and method with no reinforcement learning which is the basic relation classifier trained on original DS dataset . We use the last as the baseline .

Dataset

A widely used DS dataset , which is developed by
BIBREF12
, is used as the original dataset to be selected . The dataset is generated by aligning Freebase with New York Times corpus .

Metric and Performance Comparison

We compare the data selection model performance by the final performance of the basic model trained on newly constructed dataset selected by different models . We use the precision / recall curves as the main metric . Fig.3 presents this comparison . PR REINFORCE constructs cleaned DS dataset with less noisy data compared with the original REINFORCE so that the BGRU+2ATT classifier can reach better performance .

Conclusions

In this paper , we develop a posterior regularized REINFORCE methodology to alleviate the wrong label problem in distant supervision . Our model makes full use of the hand - crafted domain - specific rules in the trial and error search during the training process of REINFORCE method for DS dataset selection . The experiment results show that PR REINFORCE outperforms the original REINFORCE . Moreover , PR REINFORCE greatly improves the efficiency of the REINFORCE training .

Acknowledgments

This work has been supported in part by NSFC ( No.61751209 , U1611461 ) , 973 program ( No . 2015CB352302 ) , Hikvision - Zhejiang University Joint Research Center , Chinese Knowledge Center of Engineering Science and Technology ( CKCEST ) , Engineering Research Center of Digital Library , Ministry of Education . Xiang Ren 's research has been supported in part by National Science Foundation SMA 18 - 29268 .

agent new bag INLINEFORM11 at the terminal of the trajectory with less wrong labeled instances
REINFORCE a Monte Carlo algorithm
BIBREF6
BIBREF7
employs the attention mechanism to put different attention weight on each sentence in one bag and assume each sentence is related to the relation but have a different correlation
Reinforcement learning widely used in data selection and natural language processing
r relation n r the number of relation types
θ b the parameter vector of the basic relation classifier INLINEFORM3
B k the input bag of the classifier
sentence representation calculated by the sentence encoder network BGRU the takes the instance INLINEFORM0 as input and output the sentence representation BGRU
key of the model how to represent the state in every step and the reward at the terminal of the trajectory
of the candidate basic relation
method model on the Freebase triples that have been mentioned in the training and testing dataset
relation embedding re k computed by the difference of the entity embedding element - wise
l the indicator function
Z a normalization term
precision recall the main metric
Learning Patient Representations from Text 1805.02096 2018 S18-2014
machine learning
topic modeling
medical informatics
electronic health records

Introduction

Mining electronic health records for patients who satisfy a set of predefined criteria is known in medical informatics as phenotyping . Phenotyping has numerous applications such as outcome prediction , clinical trial recruitment , and retrospective studies . Supervised machine learning is currently the predominant approach to automatic phenotyping and it typically relies on sparse patient representations such as bag - of - words and bag - of - concepts
BIBREF0
. We consider an alternative that involves learning patient representations . Our goal is to develop a conceptually simple method for learning lower dimensional dense patient representations that succinctly capture the information about a patient and are suitable for downstream machine learning tasks . Our method uses cheap supervision in the form of billing codes and thus has representational power of a large dataset . The learned representations can be used to train phenotyping classifiers with much smaller datasets .

Recent trends in machine learning have used neural networks for representation learning , and these ideas have propagated into the clinical informatics literature , using information from electronic health records to learn dense patient representations
BIBREF1
,
BIBREF2
,
BIBREF3
,
BIBREF4
,
BIBREF5
,
BIBREF6
. Most of this work to date has used only codified variables , including ICD ( International Classification of Diseases ) codes , procedure codes , and medication orders , often reduced to smaller subsets . Recurrent neural networks are commonly used to represent temporality
BIBREF1
,
BIBREF2
,
BIBREF3
,
BIBREF6
, and many methods map from code vocabularies to dense “ embedding
” input spaces
BIBREF1
,
BIBREF2
,
BIBREF5
,
BIBREF6
.


One of the few patient representation learning systems to incorporate electronic medical record ( EMR ) text is DeepPatient
BIBREF4
. This system takes as input a variety of features , including coded diagnoses as the above systems , but also uses topic modeling on the text to get topic features , and applies a tool that maps text spans to clinical concepts in standard vocabularies ( SNOMED and RxNorm ) . To learn the representations they use a model consisting of stacked denoising autoencoders . In an autoencoder network , the goal of training is to reconstruct the input using hidden layers that compress the size of the input . The output layer and the input layer therefore have the same size , and the loss function calculates reconstruction error . The hidden layers thus form the patient representation . This method is used to predict novel ICD codes ( from a reduced set with 78 elements ) occurring in the next 30 , 60 , 90 , and 180 days .

Our work extends these methods by building a neural network system for learning patient representations using text variables only . We train this model to predict billing codes , but solely as a means to learning representations . We show that the representations learned for this task are general enough to obtain state - of - the - art performance on a standard comorbidity detection task . Our work can also be viewed as an instance of transfer learning
BIBREF7
: we store the knowledge gained from a source task ( billing code prediction ) and apply it to a different but related target task .

Patient Representation Learning

The objective of patient representation learning is to map raw text of patient notes to a dense vector that can be subsequently used for various patient - level predictive analytics tasks such as phenotyping , outcome prediction , and cluster analysis . The process of learning patient representations involves two phases : ( 1 ) supervised training of a neural network model on a source task that has abundant labeled data linking patients with some outcomes ; ( 2 ) patient vector derivation for a target task performed by presenting new patient data to the network and harvesting the resulting representations from one of the hidden layers .

In this work , we utilize billing codes as a source of supervision for learning patient vectors in phase 1 . Billing codes , such as ICD9 diagnostic codes , ICD9 procedure codes , and CPT codes are derived manually by medical coders from patient records for the purpose of billing . Billing codes are typically available in abundance in a healthcare institution and present a cheap source of supervision . Our hypothesis is that a patient vector useful for predicting billing codes will capture key characteristics of a patient , making this vector suitable for patient - level analysis .

For learning dense patient vectors , we propose a neural network model that takes as input a set of UMLS concept unique identifiers ( CUIs ) derived from the text of the notes of a patient and jointly predicts all billing codes associated with the patient . CUIs are extracted from notes by mapping spans of clinically - relevant text ( e.g. shortness of breath , appendectomy , MRI ) to entries in the UMLS Metathesaurus . CUIs can be easily extracted by existing tools such as Apache cTAKES ( http://ctakes.apache.org ) . Our neural network model ( Figure 1 ) is inspired by Deep Averaging Network ( DAN )
BIBREF8
, FastText
BIBREF9
, and continuous bag - of - words ( CBOW )
BIBREF10
,
BIBREF11
models .


Model Architecture : The model takes as input a set of CUIs . CUIs are mapped to 300-dimensional concept embeddings which are averaged and passed on to a 1000-dimensional hidden layer , creating a vectorial representation of a patient . The final network layer consists of n sigmoid units that are used for joint billing code prediction . The output of each sigmoid unit is converted to a binary ( 1/0 ) outcome . The number of units n in the output layer is equal to the number of unique codes being predicted . The model is trained using binary cross - entropy loss function using RMSProp optimizer . Our model is capable of jointly predicting multiple billing codes for a patient , placing it into the family of supervised multi - label classification methods . In our preliminary work , we experimented with CNN and RNN - based architectures but their performance was inferior to the model described here both in terms of accuracy and speed .

Once the model achieves an acceptable level of performance , we can compute a vector representing a new patient by freezing the network weights , pushing CUIs for a new patient through the network , and harvesting the computed values of the nodes in the hidden layer . The resulting 1000-dimensional vectors can be used for a variety of machine learning tasks .

Datasets

For training patient representations , we utilize the MIMIC III corpus
BIBREF12
. MIMIC III contains notes for over 40,000 critical care unit patients admitted to Beth Israel Deaconess Medical Center as well as ICD9 diagnostic , procedure , and Current Procedural Terminology ( CPT ) codes . Since our goal is learning patient - level representations , we concatenate all available notes for each patient into a single document . We also combine all ICD9 and CPT codes for a patient to form the targets for the prediction task . Finally , we process the patient documents with cTAKES to extract UMLS CUIs . cTAKES is an open - source system for processing clinical texts which has an efficient dictionary lookup component for identifying CUIs , making it possible to process a large number of patient documents .

To decrease training time , we reduce the complexity of the prediction task as follows : ( 1 ) we collapse all ICD9 and CPT codes to their more general category ( e.g. first three digits for ICD9 diagnostic codes ) , ( 2 ) we drop all CUIs that appear fewer than 100 times , ( 3 ) we discard patients that have over 10,000 CUIs , ( 4 ) we discard all billing codes that have fewer than 1,000 examples . This preprocessing results in a dataset consisting of 44,211 patients mapped to multiple codes ( 174 categories total ) . We randomly split the patients into a training set ( 80 % ) and a validation set ( 20 % ) for tuning hyperparameters .

For evaluating our patient representations , we use a publicly available dataset from the Informatics for Integrating Biology to the Bedside ( i2b2 ) Obesity challenge
BIBREF13
. Obesity challenge data consisted of 1237 discharge summaries from the Partners HealthCare Research Patient Data Repository annotated with respect to obesity and its fifteen most common comorbidities . Each patient was thus labeled for sixteen different categories . We focus on the more challenging intuitive task
BIBREF13
,
BIBREF14
, containing three label types ( present , absent , questionable ) , where annotators labeled a diagnosis as present if its presence could be inferred ( i.e. , even if not explicitly mentioned ) .
This task involves complicated decision - making and inference .

Importantly , our patient representations are evaluated in sixteen different classification tasks with patient data originating from a healthcare institution different from the one our representations were trained on . This setup is challenging yet it presents a true test of robustness of the learned representations .

Experiments

Our first baseline is an SVM classifier trained with bag - of - CUIs features . Our second baseline involves linear dimensionality reduction performed by running singular value decomposition ( SVD ) on a patient - CUI matrix derived from the MIMIC corpus , reducing the space by selecting the 1000 largest singular values , and mapping the target instances into the resulting 1000-dimensional space .

Our multi - label billing code classifier is trained to maximize the macro F1 score for billing code prediction on the validation set . We train the model for 75 epochs with a learning rate of 0.001 and batch size of 50 . These hyperparameters are obtained by tuning the model 's macro F1 on the validation set . Observe that tuning of hyperparameters occurred independently from the target task . Also note that since our goal is not to obtain the best possible performance on a held out set , we are not allocating separate development and test sets . Once we determine the best values of these hyperparameters , we combine the training and validation sets and retrain the model . We train two version of the model : ( 1 ) with randomly initialized CUI embeddings , ( 2 ) with word2vec - pretrained CUI embeddings . Pre - trained embeddings are learned using word2vec
BIBREF10
by extracting all CUIs from the text of MIMIC III notes and using the CBOW method with windows size of 5 and embedding dimension of 300 .

We then create a 1000-dimensional vector representation for each patient in the i2b2 obesity challenge data by giving the sparse ( CUI - based ) representation for each patient as input to the ICD code classifier . Rather than reading the classifier 's predictions , we harvest the hidden layer outputs , forming a 1000-dimensional dense vector . We then train multi - class SVM classifiers for each disease ( using one - vs.-all strategy ) , building sixteen SVM classifiers . Following the i2b2 obesity challenge , the models are evaluated using macro precision , recall , and F1 scores
BIBREF13
.

We make the code available for use by the research community .

Results

Our billing code classifier achieves the macro F1 score on the source task ( billing code prediction ) of 0.447 when using randomly initialized CUI embeddings and macro F1 of 0.473 when using pre - trained CUI embeddings . This is not directly comparable to existing work because it is a unique setup ; but we note that this is likely a difficult task because of the large output space . However , it is interesting to note that pre - training CUI embedding has a positive relative impact on performance .

Classifier performance for the target phenotyping task is shown in Table
TABREF6
, which shows the performance of the baseline SVM classifier trained using the standard bag - of - CUIs approach ( Sparse ) , the baseline using 1000-dimensional vectors obtained via dimensionality reduction ( SVD ) , and our system using dense patient vectors derived from the source task . Since a separate SVM classifier was trained for each disease , we present classifier performance for each SVM model .

Both of our baseline approaches showed approximately the same performance ( F1=0.675 ) as the best reported i2b2 system
BIBREF15
( although they used a rule - based approach ) .
Our dense patient representations outperformed both baseline approaches by four percentage points on average ( F1=0.715 ) . The difference is statistically significant ( t - test , p=0.03 ) .

Out of the sixteen diseases , our dense representations performed worse ( with one tie ) than the sparse baseline only for three : gallstones , hypertriglyceridemia , venous insufficiency . The likely cause is the scarcity of positive training examples ; two of these diseases have the smallest number of positive training examples .

Discussion and Conclusion

For most diseases and on average our dense patient representations outperformed sparse patient representations . Importantly , patient representations were learned from a task ( billing code prediction ) that is different from the evaluation task ( comorbidity prediction ) , presenting evidence that useful representations can be derived in this transfer learning scenario .

Furthermore , the data from which the representations were learned ( BI medical center ) and the evaluation data ( Partners HealthCare ) originated from two different healthcare institutions providing evidence of robustness of our patient representations .

Our future work will include exploring the use of other sources of supervision for learning patient representations , alternative neural network architectures , tuning the learned patient representations to the target task , and evaluating the patient representations on other phenotyping tasks .

Acknowledgments

The Titan X GPU used for this research was donated by the NVIDIA Corporation . Timothy Miller 's effort was supported by National Institute of General Medical Sciences ( NIGMS ) of the National Institutes of Health under award number R01GM114355 . The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health .

Recurrent neural networks used to represent temporality
BIBREF2
BIBREF3
, , , BIBREF6 , and many methods map from code vocabularies to dense “ embedding spaces BIBREF1
Billing codes ICD9 codes ICD9 procedure codes , and CPT codes derived manually by medical coders from patient records for the purpose of billing
neural network model Network bag - of - words
Model Architecture The model takes as input a set of CUIs
MIMIC contains notes for over 40,000 critical care unit patients admitted to Beth Israel Deaconess Medical Center well as ICD9 diagnostic , procedure
cTAKES an open - source system for processing clinical texts which has an efficient dictionary lookup component for identifying CUIs
annotators a if its presence could be inferred
multi - class SVM classifiers
disease - vs.-all strategy one
F1=0.675 the best system
difference statistically significant t - test
General Medical Sciences the of number
SimVerb-3500: A Large-Scale Evaluation Set of Verb Similarity 1608.00869 2016 D16-1235
single learning algorithm
word representation
single learning
verb semantics
human language acquisition
distributed representation learning
word representation learning

Introduction

Verbs are famously both complex and variable . They express the semantics of an event as well the relational information among participants in that event , and they display a rich range of syntactic and semantic behaviour
BIBREF0
,
BIBREF1
,
BIBREF2
.
Verbs play a key role at almost every level of linguistic analysis . Information related to their predicate argument structure can benefit many NLP tasks ( e.g. parsing , semantic role labelling , information extraction ) and applications ( e.g. machine translation , text mining ) as well as research on human language acquisition and processing
BIBREF3
. Precise methods for representing and understanding verb semantics will undoubtedly be necessary for machines to interpret the meaning of sentences with similar accuracy to humans .

Numerous algorithms for acquiring word representations from text and/or more structured knowledge bases have been developed in recent years
BIBREF4
,
BIBREF5
,
BIBREF6
. These representations ( or embeddings ) typically contain powerful features that are applicable to many language applications
BIBREF7
,
BIBREF8
. Nevertheless , the predominant approaches to distributed representation learning apply a single learning algorithm and representational form for all words in a vocabulary . This is despite evidence that applying different learning algorithms to word types such as nouns , adjectives and verbs can significantly increase the ultimate usefulness of representations
BIBREF9
.

One factor behind the lack of more nuanced word representation learning methods is the scarcity of satisfactory ways to evaluate or analyse representations of particular word types . Resources such as MEN
BIBREF10
, Rare Words
BIBREF11
and SimLex-999
BIBREF12
focus either on words from a single class or small samples of different word types , with automatic approaches already reaching or surpassing the inter - annotator agreement ceiling . Consequently , for word classes such as verbs , whose semantics is critical for language understanding , it is practically impossible to achieve statistically robust analyses and comparisons between different representation learning architectures .

To overcome this barrier to verb semantics research , we introduce SimVerb-3500 – an extensive intrinsic evaluation resource that is unprecedented in both size and coverage . SimVerb-3500 includes 827 verb types from the University of South Florida Free Association Norms ( USF )
BIBREF13
, and at least 3 member verbs from each of the 101 top - level VerbNet classes
BIBREF14
. This coverage enables researchers to better understand the complex diversity of syntactic - semantic verb behaviours , and provides direct links to other established semantic resources such as WordNet
BIBREF15
and PropBank
BIBREF16
. Moreover , the large standardised development and test sets in SimVerb-3500 allow for principled tuning of hyperparameters , a critical aspect of achieving strong performance with the latest representation learning architectures .

In s : rw , we discuss previous evaluation resources targeting verb similarity . We present the new SimVerb-3500 data set along with our design choices and the pair selection process in s : dataset , while the annotation process is detailed in s : annotation . In s : analysis we report the performance of a diverse range of popular representation learning architectures , together with benchmark performance on existing evaluation sets . In s : evaluation , we show how SimVerb-3500 enables a variety of new linguistic analyses , which were previously impossible due to the lack of coverage and scale in existing resources .

Related Work

A natural way to evaluate representation quality is by judging the similarity of representations assigned to similar words . The most popular evaluation sets at present consist of word pairs with similarity ratings produced by human annotators . Nevertheless , we find that all available datasets of this kind are insufficient for judging verb similarity due to their small size or narrow coverage of verbs .

In particular , a number of word pair evaluation sets are prominent in the distributional semantics literature .

Representative examples include RG-65
BIBREF17
and WordSim-353
BIBREF18
,
BIBREF19
which are small ( 65 and 353 word pairs , respectively ) . Larger evaluation sets such as the Rare Words evaluation set
BIBREF11
( 2034 word pairs ) and the evaluations sets from Silberer:2014acl are dominated by noun pairs and the former also focuses on low - frequency phenomena . Therefore , these datasets do not provide a representative sample of verbs
BIBREF12
.

Two datasets that do focus on verb pairs to some extent are the data set of Baker:2014emnlp and Simlex-999
BIBREF12
. These datasets , however , still contain a limited number of verb pairs ( 134 and 222 , respectively ) , making them unrepresentative of the rich variety of verb semantic phenomena .

In this paper we provide a remedy for this problem by presenting a more comprehensive and representative verb pair evaluation resource .

The SimVerb-3500 Data Set

In this section , we discuss the design principles behind SimVerb-3500 . We first demonstrate that a new evaluation resource for verb similarity is a necessity . We then describe how the final verb pairs were selected with the goal to be representative , that is , to guarantee a wide coverage of two standard semantic resources : USF and VerbNet .

Design Motivation

Hill:2015cl argue that comprehensive high - quality evaluation resources have to satisfy the following three criteria : ( C1 ) Representative ( the resource covers the full range of concepts occurring in natural language ) ; ( C2 ) Clearly defined ( it clearly defines the annotated relation , e.g. , similarity ) ; ( C3 ) Consistent and reliable ( untrained native speakers must be able to quantify the target relation consistently relying on simple instructions ) .

Building on the same annotation guidelines as Simlex-999 that explicitly targets similarity , we ensure that criteria C2 and C3 are satisfied . However , even SimLex , as the most extensive evaluation resource for verb similarity available at present , is still of limited size , spanning only 222 verb pairs and 170 distinct verb lemmas in total . Given that 39 out of the 101 top - level VerbNet classes are not represented at all in SimLex , while 20 classes have only one member verb , one may conclude that the criterion C1 is not at all satisfied with current resources .

There is another fundamental limitation of all current verb similarity evaluation resources : automatic approaches have reached or surpassed the inter - annotator agreement ceiling . For instance , while the average pairwise correlation between annotators on SL-222 is Spearman 's ρ correlation of 0.717 , the best performing automatic system reaches ρ=0.727
BIBREF20
. SimVerb-3500 does not inherit this anomaly ( see Tab .
TABREF23
) and demonstrates that there still exists an evident gap between the human and system performance .

In order to satisfy C1-C3 , the new SimVerb-3500 evaluation set contains similarity ratings for 3,500 verb pairs , containing 827 verb types in total and 3 member verbs for each top - level VerbNet class . The rating scale goes from 0 ( not similar at all ) to 10 ( synonymous ) . We employed the SimLex-999 annotation guidelines . In particular , we instructed annotators to give low ratings to antonyms , and to distinguish between similarity and relatedness . Pairs that are related but not similar ( e.g. , to snore / to snooze , to walk / to crawl ) thus have a fairly low rating . Several example pairs are provided in Tab .
TABREF7
.

Choice of Verb Pairs and Coverage

To ensure a wide coverage of a variety of syntactico - semantic phenomena ( C1 ) , the choice of verb pairs is steered by two standard semantic resources available online : ( 1 ) the USF norms data set
BIBREF13
, and ( 2 ) the VerbNet verb lexicon
BIBREF21
,
BIBREF14
.

The USF norms data set ( further USF ) is the largest database of free association collected for English . It was generated by presenting human subjects with one of 5,000 cue concepts and asking them to write the first word coming to mind that is associated with that concept . Each cue concept c was normed in this way by over 10 participants , resulting in a set of associates a for each cue , for a total of over 72,000 (c,a) pairs . For each such pair , the proportion of participants who produced associate a when presented with cue c can be used as a proxy for the strength of association between the two words .

The norming process guarantees that two words in a pair have a degree of semantic association which correlates well with semantic relatedness and similarity . Sampling from the USF set ensures that both related but non - similar pairs ( e.g. , to run / to sweat ) as well as similar pairs ( e.g. , to reply / to respond ) are represented in the final list of pairs . Further , the rich annotations of the output USF data ( e.g. , concreteness scores , association strength ) can be directly combined with the SimVerb-3500 similarity scores to yield additional analyses and insight .

VerbNet ( VN ) is the largest online verb lexicon currently available for English . It is hierarchical , domain - independent , and broad - coverage . VN is organised into verb classes extending the classes from Levin:1993book through further refinement to achieve syntactic and semantic coherence among class members . According to the official VerbNet guidelines , “ Verb Classes are numbered according to shared semantics and syntax , and classes which share a top - level number ( 9 - 109 ) have corresponding semantic relationships . ” For instance , all verbs from the top - level Class 9 are labelled “ Verbs of Putting ” , all verbs from Class 30 are labelled “ Verbs of Perception ” , while Class 39 contains “ Verbs of Ingesting ” .

Among others , three basic types of information are covered in VN : ( 1 ) verb subcategorization frames ( SCFs ) , which describe the syntactic realization of the predicate - argument structure ( e.g. The window broke ) , ( 2 ) selectional preferences ( SPs ) , which capture the semantic preferences verbs have for their arguments ( e.g. a breakable physical object broke ) and ( 3 ) lexical - semantic verb classes ( VCs ) which provide a shared level of abstraction for verbs similar in their ( morpho-)syntactic and semantic properties ( e.g. BREAK verbs , sharing the VN class 45.1 , and the top - level VN class 45 ) . The basic overview of the VerbNet structure already suggests that measuring verb similarity is far from trivial as it revolves around a complex interplay between various semantic and syntactic properties .

The wide coverage of VN in SimVerb-3500 assures the wide coverage of distinct verb groups / classes and their related linguistic phenomena . Finally , VerbNet enables further connections of SimVerb-3500 to other important lexical resources such as FrameNet
BIBREF22
, WordNet
BIBREF15
, and PropBank
BIBREF16
through the sets of mappings created by the SemLink project initiative
BIBREF23
.

We next sketch the complete sampling procedure which resulted in the final set of 3500 distinct verb pairs finally annotated in a crowdsourcing study ( s : annotation ) .

( Step 1 ) We extracted all possible verb pairs from USF based on the associated POS tags available as part of USF annotations . To ensure that semantic association between verbs in a pair is not accidental , we then discarded all such USF pairs that had been associated by 2 or less participants in USF .

( Step 2 ) We then manually cleaned and simplified the list of pairs by removing all pairs with multi - word verbs ( e.g. , quit / give up ) , all pairs that contained the non - infinitive form of a verb ( e.g. , accomplished / finished , hidden / find ) , removing all pairs containing at least one auxiliary verb ( e.g. , must / to see , must / to be ) . The first two steps resulted in 3,072 USF - based verb pairs .

( Step 3 ) After this stage , we noticed that several top - level VN classes are not part of the extracted set . For instance , 5 VN classes did not have any member verbs included , 22 VN classes had only 1 verb , and 6 VN classes had 2 verbs included in the current set .

We resolved the VerbNet coverage issue by sampling from such ' under - represented ' VN classes directly . Note that this step is not related to USF at all . For each such class we sampled additional verb types until the class was represented by 3 or 4 member verbs ( chosen randomly ) . Following that , we sampled at least 2 verb pairs for each previously ' under - represented ' VN class by pairing 2 member verbs from each such class . This procedure resulted in 81 additional pairs , now 3,153 in total .

( Step 4 ) Finally , to complement this set with a sample of entirely unassociated pairs , we followed the SimLex-999 setup . We paired up the verbs from the 3,153 associated pairs at random . From these random parings , we excluded those that coincidentally occurred elsewhere in USF ( and therefore had a degree of association ) . We sampled the remaining 347 pairs from this resulting set of unassociated pairs .

( Output ) The final SimVerb-3500 data set contains 3,500 verb pairs in total , covering all associated verb pairs from USF , and ( almost ) all top - level VerbNet classes . All pairs were manually checked post - hoc by the authors plus 2 additional native English speakers to verify that the final data set does not contain unknown or invalid verb types .

The 3,500 pairs consist of 827 distinct verbs . 29 top - level VN classes are represented by 3 member verbs , while the three most represented classes cover 79 , 85 , and 93 member verbs . 40 verbs are not members of any VN class .

We performed an initial frequency analysis of SimVerb-3500 relying on the BNC counts available online
BIBREF24
. After ranking all BNC verbs according to their frequency , we divided the list into quartiles : Q1 ( most frequent verbs in BNC ) - Q4 ( least frequent verbs in BNC ) . Out of the 827 SimVerb-3500 verb types , 677 are contained in Q1 , 122 in Q2 , 18 in Q3 , 4 in Q4 ( to enroll , to hitchhike , to implode , to whelp ) , while 6 verbs are not covered in the BNC list . 2,818 verb pairs contain Q1 verbs , while there are 43 verb pairs with both verbs not in Q1 . Further empirical analyses are provided in s : evaluation .

Word Pair Scoring

We employ the Prolific Academic ( PA ) crowdsourcing platform , an online marketplace very similar to Amazon Mechanical Turk and to CrowdFlower .

Survey Structure

Following the SimLex-999 annotation guidelines , we had each of the 3500 verb pairs rated by at least 10 annotators . To distribute the workload , we divided the 3500 pairs into 70 tranches , with 79 pairs each . Out of the 79 pairs , 50 are unique to one tranche , while 20 manually chosen pairs are in all tranches to ensure consistency . The remaining 9 are duplicate pairs displayed to the same participant multiple times to detect inconsistent annotations .

Participants see 7 - 8 pairs per page . Pairs are rated on a scale of 0 - 6 by moving a slider . The first page shows 7 pairs , 5 unique ones and 2 from the consistency set . The following pages are structured the same but display one extra pair from the previous page . Participants are explicitly asked to give these duplicate pairs the same rating . We use them as quality control so that we can identify and exclude participants giving several inconsistent answers .

The survey contains three control questions in which participants are asked to select the most similar pair out of three choices . For instance , the first checkpoint is : Which of these pairs of words is the * most * similar ? 1 . to run / to jog 2 . to run / to walk 3 . to jog / to sweat . One checkpoint occurs right after the instructions and the other two later in the survey . The purpose is to check that annotators have understood the guidelines and to have another quality control measure for ensuring that they are paying attention throughout the survey . If just one of the checkpoint questions is answered incorrectly , the survey ends immediately and all scores from the annotator in question are discarded .

843 raters participated in the study , producing over 65,000 ratings . Unlike other crowdsourcing platforms , PA collects and stores detailed demographic information from the participants upfront . This information was used to carefully select the pool of eligible participants . We restricted the pool to native English speakers with a 90 % approval rate ( maximum rate on PA ) , of age 18 - 50 , born and currently residing in the US ( 45 % out of 843 raters ) , UK ( 53 % ) , or Ireland ( 2 % ) . 54 % of the raters were female and 46 % male , with the average age of 30 . Participants took 8 minutes on average to complete the survey containing 79 questions .

Post-Processing

We excluded ratings of annotators who ( a ) answered one of the checkpoint questions incorrectly ( 75 % of exclusions ) ; ( b ) did not give equal ratings to duplicate pairs ; ( c ) showed suspicious rating patterns ( e.g. , randomly alternating between two ratings or using one single rating throughout ) . The final acceptance rate was 84 % . We then calculated the average of all ratings from the accepted raters ( 10 ) for each pair . The score was finally scaled linearly from the 0 - 6 to the 0 - 10 interval as in
BIBREF12
.

Evaluating Subsets

The large coverage and scale of SimVerb-3500 enables model evaluation based on selected criteria . In this section , we showcase a few example analyses .

Conclusions

SimVerb-3500 is a verb similarity resource for analysis and evaluation that will be of use to researchers involved in understanding how humans or machines represent the meaning of verbs , and , by extension , scenes , events and full sentences . The size and coverage of syntactico - semantic phenomena in SimVerb-3500 makes it possible to compare the strengths and weaknesses of various representation models via statistically robust analyses on specific word classes .

To demonstrate the utility of SimVerb-3500 , we conducted a selection of analyses with existing representation - learning models . One clear conclusion is that distributional models trained on raw text ( e.g. SGNS ) perform very poorly on low frequency and highly polysemous verbs . This degradation in performance can be partially mitigated by focusing models on more principled distributional contexts , such as those defined by symmetric patterns . More generally , the finding suggests that , in order to model the diverse spectrum of verb semantics , we may require algorithms that are better suited to fast learning from few examples
BIBREF32
, and have some flexibility with respect to sense - level distinctions
BIBREF33
,
BIBREF34
. In future work we aim to apply such methods to the task of verb acquisition .

Beyond the preliminary conclusions from these initial analyses , the benefit of SimLex-3500 will become clear as researchers use it to probe the relationship between architectures , algorithms and representation quality for a wide range of verb classes . Better understanding of how to represent the full diversity of verbs should in turn yield improved methods for encoding and interpreting the facts , propositions , relations and events that constitute much of the important information in language .

Acknowledgments

This work is supported by the ERC Consolidator Grant LEXICAL ( 648909 ) .

SimVerb-3500 : A Large - Scale Evaluation Set of Verb Similarity

Supplementary Material

]

Unsupervised Text-Based Models

These models mainly learn from co - occurrence statistics in large corpora , therefore to facilitate the generality of our results , we evaluate them on two different corpora . With 8B we refer to the corpus produced by the word2vec script , consisting of 8 billion tokens from various sources
BIBREF4
.
With PW we refer to the English Polyglot Wikipedia corpus
BIBREF35
. d denotes the embedding dimensionality , and ws is the window size in case of bag - of - word contexts . The models we consider are as follows :

Skip - gram with negative sampling ( SGNS )
BIBREF4
,
BIBREF45
trained with bag - of - words ( BOW ) contexts ; d=500 , ws=2 on 8B as in prior work
BIBREF44
,
BIBREF48
. d=300 , ws=2 on PW as in prior work
BIBREF42
,
BIBREF50
.

SGNS trained with universal dependency ( UD ) contexts following the setup of
BIBREF42
,
BIBREF50
. The PW data were POS - tagged with universal POS ( UPOS ) tags
BIBREF47
using TurboTagger
BIBREF43
, trained using default settings without any further parameter fine - tuning ( SVM MIRA with 20 iterations ) on the train+dev portion of the UD treebank annotated with UPOS tags . The data were then parsed using the graph - based Mate parser v3.61
BIBREF37
. d=300 as in
BIBREF50


Another variant of a dependency - based SGNS model is taken from the recent work of Schwartz:2016naacl , based on Levy:2014acl . The 8B corpus is parsed with labeled Stanford dependencies
BIBREF39
, the Stanford POS Tagger
BIBREF49
and the stack version of the MALT parser
BIBREF41
are used ; d=500 as in prior work
BIBREF48
.

All other parameters of all SGNS models are set to the standard settings : the models are trained with stochastic gradient descent , global learning rate of 0.025 , subsampling rate 1e-4 , 15 epochs .

A template - based approach to vector space modeling introduced by Schwartz:2015conll . Vectors are trained based on co - occurrence of words in symmetric patterns
BIBREF38
, and an antonym detection mechanism is plugged in the representations . We use pre - trained dense vectors ( d=300 and d=500 ) with the antonym detector enabled , available online .

Traditional count - based vectors using PMI weighting and SVD dimensionality reduction ( ws = 2 ; d = 500 ) . This is the best performing reduced count - based model from Baroni2014acl , vectors were obtained online .

Models Relying on External Resources

Sparse binary vectors built from a wide variety of hand - crafted linguistic resources , e.g. , WordNet , Supersenses , FrameNet , Emotion and Sentiment lexicons , Connotation lexicon , among others
BIBREF29
.

Wieting:2015tacl use the Paraphrase Database ( PPDB )
BIBREF40
word pairs to learn word vectors which emphasise paraphrasability .
They do this by fine - tuning , also known as retro - fitting
BIBREF6
, word2vec vectors using a SGNS inspired objective function designed to incorporate the PPDB semantic similarity constraints . Two variants are available online : d=25 and d=300 .

Mrksic:2016naacl suggest another variant of the retro - fitting procedure called counter - fitting ( CF ) which further improves the Paragram vectors by injecting antonymy constraints from PPDB v2.0
BIBREF46
into the final vector space .
d=300 .

event the relational information among participants in event a rich range of syntactic and semantic behaviour BIBREF0 , BIBREF1
rating scale not similar at all ) to 10 ( synonymous
VerbNet ( VN ) the online verb lexicon currently available for English
Class contains
frames SCFs describe realization
selectional preferences SPs capture the preferences verbs have
lexical - semantic verb classes VCs provide a shared level of abstraction for verbs similar in their ( morpho-)syntactic and semantic properties
SimVerb-3500 A Large - Scale Evaluation Set of Verb
word2vec script consisting of
d the embedding dimensionality
ws the window size in case of bag - of - word contexts
template - based space
Self-Attention with Relative Position Representations 1803.02155 2018 N18-2074
sequence learning
relative position representations
neural networks
sequence to sequence learning
neural network

Introduction

Recent approaches to sequence to sequence learning typically leverage recurrence
BIBREF0
, convolution
BIBREF1
,
BIBREF2
, attention
BIBREF3
, or a combination of recurrence and attention
BIBREF4
,
BIBREF5
,
BIBREF6
,
BIBREF7
as basic building blocks . These approaches incorporate information about the sequential position of elements differently .

Recurrent neural networks ( RNNs ) typically compute a hidden state h t , as a function of their input at time t and a previous hidden state h t-1 , capturing relative and absolute positions along the time dimension directly through their sequential structure . Non - recurrent models do not necessarily consider input elements sequentially and may hence require explicitly encoding position information to be able to use sequence order .

One common approach is to use position encodings which are combined with input elements to expose position information to the model . These position encodings can be a deterministic function of position
BIBREF8
,
BIBREF3
or learned representations . Convolutional neural networks inherently capture relative positions within the kernel size of each convolution . They have been shown to still benefit from position encodings
BIBREF1
, however .

For the Transformer , which employs neither convolution nor recurrence , incorporating explicit representations of position information is an especially important consideration since the model is otherwise entirely invariant to sequence ordering . Attention - based models have therefore used position encodings or biased attention weights based on distance
BIBREF9
.

In this work we present an efficient way of incorporating relative position representations in the self - attention mechanism of the Transformer . Even when entirely replacing its absolute position encodings , we demonstrate significant improvements in translation quality on two machine translation tasks .

Our approach can be cast as a special case of extending the self - attention mechanism of the Transformer to considering arbitrary relations between any two elements of the input , a direction we plan to explore in future work on modeling labeled , directed graphs .

Transformer

The Transformer
BIBREF3
employs an encoder - decoder structure , consisting of stacked encoder and decoder layers .
Encoder layers consist of two sublayers : self - attention followed by a position - wise feed - forward layer . Decoder layers consist of three sublayers : self - attention followed by encoder - decoder attention , followed by a position - wise feed - forward layer . It uses residual connections around each of the sublayers , followed by layer normalization
BIBREF10
. The decoder uses masking in its self - attention to prevent a given output position from incorporating information about future output positions during training .

Position encodings based on sinusoids of varying frequency are added to encoder and decoder input elements prior to the first layer . In contrast to learned , absolute position representations , the authors hypothesized that sinusoidal position encodings would help the model to generalize to sequence lengths unseen during training by allowing it to learn to attend also by relative position . This property is shared by our relative position representations which , in contrast to absolute position representations , are invariant to the total sequence length .

Residual connections help propagate position information to higher layers .

Self-Attention

Self - attention sublayers employ h attention heads . To form the sublayer output , results from each head are concatenated and a parameterized linear transformation is applied .

Each attention head operates on an input sequence , x=(x 1 ,...,x n ) of n elements where x i ∈ℝ d x , and computes a new sequence z=(z 1 ,...,z n ) of the same length where z i ∈ℝ d z .

Each output element , z i , is computed as weighted sum of a linearly transformed input elements : DISPLAYFORM0

Each weight coefficient , α ij , is computed using a softmax function : α ij =expe ij ∑ k=1 n expe ik

And e ij is computed using a compatibility function that compares two input elements : DISPLAYFORM0

Scaled dot product was chosen for the compatibility function , which enables efficient computation . Linear transformations of the inputs add sufficient expressive power .

W Q , W K , W V ∈ℝ d x ×d z are parameter matrices . These parameter matrices are unique per layer and attention head .

Relation-aware Self-Attention

We propose an extension to self - attention to consider the pairwise relationships between input elements . In this sense , we model the input as a labeled , directed , fully - connected graph .

The edge between input elements x i and x j is represented by vectors a ij V ,a ij K ∈ℝ d a . The motivation for learning two distinct edge representations is that a ij V and a ij K are suitable for use in eq . ( EQREF6 ) and eq . ( EQREF7 ) , respectively , without requiring additional linear transformations . These representations can be shared across attention heads . We use d a =d z .

We modify eq . ( EQREF3 ) to propagate edge information to the sublayer output : DISPLAYFORM0

This extension is presumably important for tasks where information about the edge types selected by a given attention head is useful to downstream encoder or decoder layers . However , as explored in SECREF16 , this may not be necessary for machine translation .

We also , importantly , modify eq . ( EQREF4 ) to consider edges when determining compatibility : DISPLAYFORM0

The primary motivation for using simple addition to incorporate edge representations in eq . ( EQREF6 ) and eq . ( EQREF7 ) is to enable an efficient implementation described in SECREF10 .

Relative Position Representations

For linear sequences , edges can capture information about the relative position differences between input elements . The maximum relative position we consider is clipped to a maximum absolute value of k . We hypothesized that precise relative position information is not useful beyond a certain distance . Clipping the maximum distance also enables the model to generalize to sequence lengths not seen during training . Therefore , we consider 2k+1 unique edge labels . a ij K =w clip (j-i,k) K a ij V =w clip (j-i,k) V clip (x,k)=max(-k,min(k,x))

We then learn relative position representations w K =(w -k K ,...,w k K ) and w V =(w -k V ,...,w k V ) where w i K ,w i V ∈ℝ d a .

Efficient Implementation

There are practical space complexity concerns when considering edges between input elements , as noted by Veličković et al . velivckovic2017 , which considers unlabeled graph inputs to an attention model .

For a sequence of length n and h attention heads , we reduce the space complexity of storing relative position representations from O(hn 2 d a ) to O(n 2 d a ) by sharing them across each heads . Additionally , relative position representations can be shared across sequences . Therefore , the overall self - attention space complexity increases from O(bhnd z ) to O(bhnd z +n 2 d a ) . Given d a =d z , the size of the relative increase depends on n bh .

The Transformer computes self - attention efficiently for all sequences , heads , and positions in a batch using parallel matrix multiplication operations
BIBREF3
. Without relative position representations , each e ij can be computed using bh parallel multiplications of n×d z and d z ×n matrices . Each matrix multiplication computes e ij for all sequence positions , for a particular head and sequence . For any sequence and head , this requires sharing the same representation for each position across all compatibility function applications ( dot products ) with other positions .

When we consider relative positions the representations differ with different pairs of positions . This prevents us from computing all e ij for all pairs of positions in a single matrix multiplication . We also want to avoid broadcasting relative position representations . However , both issues can be resolved by splitting the computation of eq . ( EQREF7 ) into two terms : DISPLAYFORM0

The first term is identical to eq . ( EQREF4 ) , and can be computed as described above . For the second term involving relative position representations , tensor reshaping can be used to compute n parallel multiplications of bh×d z and d z ×n matrices . Each matrix multiplication computes contributions to e ij for all heads and batches , corresponding to a particular sequence position . Further reshaping allows adding the two terms . The same approach can be used to efficiently compute eq . ( EQREF6 ) .

For our machine translation experiments , the result was a modest 7 % decrease in steps per second , but we were able to maintain the same model and batch sizes on P100 GPUs as Vaswani et al . vaswani2017 .

Experimental Setup

We use the tensor2tensor library for training and evaluating our model .

We evaluated our model on the WMT 2014 machine translation task , using the WMT 2014 English - German dataset consisting of approximately 4.5 M sentence pairs and the 2014 WMT English - French dataset consisting of approximately 36 M sentence pairs .

For all experiments , we split tokens into a 32,768 word - piece vocabulary
BIBREF7
. We batched sentence pairs by approximate length , and limited input and output tokens per batch to 4096 per GPU . Each resulting training batch contained approximately 25,000 source and 25,000 target tokens .

We used the Adam optimizer
BIBREF11
with β 1 =0.9 , β 2 =0.98 , and ϵ=10 -9 . We used the same warmup and decay strategy for learning rate as Vaswani et al . vaswani2017 , with 4,000 warmup steps . During training , we employed label smoothing of value ϵ ls =0.1
BIBREF12
. For evaluation , we used beam search with a beam size of 4 and length penalty α=0.6
BIBREF7
.

For our base model , we used 6 encoder and decoder layers , d x =512 , d z =64 , 8 attention heads , 1024 feed forward inner - layer dimensions , and P dropout =0.1 . When using relative position encodings , we used clipping distance k=16 , and used unique edge representations per layer and head . We trained for 100,000 steps on 8 K40 GPUs , and did not use checkpoint averaging .

For our big model , we used 6 encoder and decoder layers , d x =1024 , d z =64 , 16 attention heads , 4096 feed forward inner - layer dimensions , and P dropout =0.3 for EN - DE and P dropout =0.1 for EN - FR . When using relative position encodings , we used k=8 , and used unique edge representations per layer . We trained for 300,000 steps on 8 P100 GPUs , and averaged the last 20 checkpoints , saved at 10 minute intervals .

Machine Translation

We compared our model using only relative position representations to the baseline Transformer
BIBREF3
with sinusoidal position encodings . We generated baseline results to isolate the impact of relative position representations from any other changes to the underlying library and experimental configuration .

For English - to - German our approach improved performance over our baseline by 0.3 and 1.3 BLEU for the base and big configurations , respectively . For English - to - French it improved by 0.5 and 0.3 BLEU for the base and big configurations , respectively . In our experiments we did not observe any benefit from including sinusoidal position encodings in addition to relative position representations . The results are shown in Table
TABREF12
.

Model Variations

We performed several experiments modifying various aspects of our model . All of our experiments in this section use the base model configuration without any absolute position representations . BLEU scores are calculated on the WMT English - to - German task using the development set , newstest2013 .

We evaluated the effect of varying the clipping distance , k , of the maximum absolute relative position difference . Notably , for k≥2 , there does not appear to be much variation in BLEU scores . However , as we use multiple encoder layers , precise relative position information may be able to propagate beyond the clipping distance . The results are shown in Table
TABREF17
.

We also evaluated the impact of ablating each of the two relative position representations defined in section SECREF5 , a ij V in eq . ( EQREF6 ) and a ij K in eq . ( EQREF7 ) . Including relative position representations solely when determining compatibility between elements may be sufficient , but further work is needed to determine whether this is true for other tasks . The results are shown in Table
TABREF18
.

Conclusions

In this paper we presented an extension to self - attention that can be used to incorporate relative position information for sequences , which improves performance for machine translation .

For future work , we plan to extend this mechanism to consider arbitrary directed , labeled graph inputs to the Transformer . We are also interested in nonlinear compatibility functions to combine input representations and edge representations . For both of these extensions , a key consideration will be determining efficient implementations .

Recurrent neural networks RNNs state INLINEFORM0 a function of their input at time INLINEFORM1 and a previous hidden state INLINEFORM2 capturing relative and absolute positions along the time dimension directly through their sequential structure
Transformer
BIBREF3
employs an encoder - decoder structure , consisting of stacked encoder and decoder layers
z i computed weighted sum of a linearly transformed input elements
Linear transformations the inputs add sufficient expressive power
W Q matrices
EQREF7 to enable an efficient implementation described in SECREF10
velivckovic2017 considers unlabeled graph inputs to an attention model
English - German dataset approximately 4.5 M sentence pairs
French of approximately 36 M sentence
Multi-Task Networks With Universe, Group, and Task Feature Learning 1907.01791 2019 P19-1079
information sharing
neural networks
multiple related tasks
multi-task
multi-task learning
learning

Introduction

In multi - task learning
BIBREF0
, multiple related tasks are learned together . Rather than learning one task at a time , multi - task learning uses information sharing between multiple tasks . This technique has been shown to be effective in multiple different areas , e.g. , vision
BIBREF1
, medicine
BIBREF2
, and natural language processing
BIBREF3
,
BIBREF4
,
BIBREF5
.

The selection of tasks to be trained together in multi - task learning can be seen as a form of supervision : The modeler picks tasks that are known a priori to share some commonalities and decides to train them together . In this paper , we consider the case when information about the relationships of these tasks is available as well , in the form of natural groups of these tasks . Such task groups can be available in various multi - task learning scenarios : In multi - language modeling , when learning to parse or translate multiple languages jointly , information on language families would be available ; in multimodal modeling , e.g. , when learning text tasks and image tasks jointly , clustering the tasks into these two groups would be natural . In multi - domain modeling , which is the focus of this paper , different tasks naturally group into different domains .

We hypothesize that adding such inter - task supervision can encourage a model to generalize along the desired task dimensions . We introduce neural network architectures that can encode task groups , in two variants :

These neural network architectures are general and can be applied to any multi - task learning problem in which the tasks can be grouped into different task groups .

Proposed Architectures

The goal of multi - task learning ( MTL ) is to utilize shared information across related tasks . The features learned in one task could be transferred to reinforce the feature learning of other tasks , thereby boosting the performance of all tasks via mutual feedback within a unified MTL architecture . We consider the problem of multi - domain natural language understanding ( NLU ) for virtual assistants . Recent progress has been made to build NLU models to identify and extract structured information from user 's request by jointly learning intent classification ( IC ) and slot filling ( SF )
BIBREF7
. However , in practice , a common issue when building NLU models for every skill is that the amount of annotated training data varies across skills and is small for many individual skills . Motivated by the idea of learning multiple tasks jointly , the paucity of data can be resolved by transferring knowledge between different tasks that can reinforce one another .

In what follows , we describe four end - to - end MTL architectures ( Sections SECREF10 to SECREF20 ) . These architectures are encoder - decoder architectures where the encoder extracts three different sets of features : task , task group , and task universe features , and the decoder produces desired outputs based on feature representations . In particular , the first one ( Figure
FIGREF13
) is a parallel MTL architecture where task , task group , and task universe features are encoded in parallel and then concatenated to produce a composite representation . The next three architectures ( Figure
FIGREF15
) are serial architectures in different variants : In the first serial MTL architecture , group and universe features are learned first and are then used as inputs to learn task - specific features . The next serial architecture is similar but introduces highway connections that feed representations from earlier stages in the series directly into later stages . In the last architecture , the order of serially learned features is changed , so that task - specific features are encoded first .

In Section SECREF21 , we introduce an encoder - decoder architecture to perform slot filling and intent classification jointly in a multi - domain scenario for virtual assistants . Although we conduct experiments on multi - domain NLU systems of virtual assistants , the architectures can easily be applied to other tasks . Specifically , the encoder / decoder could be instantiated with any components or architectures , i.e. , Bi - LSTM
BIBREF8
for the encoder , and classification or sequential labeling for the decoder .


Parallel MTL Architecture

The first architecture , shown in Figure
FIGREF13
, is designed to learn the three sets of features at the same stage ; therefore we call it a parallel MTL architecture , or Parallel[Univ+Group+Task ] . This architecture uses three types of encoders : 1 ) A universe encoder to extract the common features across all tasks ; 2 ) task - specific encoders to extract task - specific features ; and 3 ) group - specific encoders to extract features within the same group . Finally , these three feature representations are concatenated and passed through the task - specific decoders to produce the output .

Assume we are given a MTL problem with m groups of tasks . Each task is associated with a dataset of training examples D={(x 1 i ,y 1 i ),...,(x m i i ,y m i i )} i=1 m , where x k i and y k i denote input data and corresponding labels for task k in group i . The parameters of the parallel MTL model ( and also for the other MTL models ) are trained to minimize the weighted sum of individual task - specific losses that can be computed as : DISPLAYFORM0

where α j i is a static weight for task j in group i , which could be proportional to the size of training data of the task . The loss function ℒ j i is defined based on the tasks performed by the decoder , which will be described in Section SECREF21 .

To eliminate redundancy in the features learned by three different types of encoders , we add adversarial loss and orthogonality constraints cost
BIBREF9
,
BIBREF10
. Adding adversarial loss aims to prevent task - specific features from creeping into the shared space . We apply adversarial training to our shared encoders , i.e. , the universe and group encoders . To encourage task , group , and universe encoders to learn features from different aspects of the inputs , we add orthogonality constraints between task and universe / group representations of each domain . The loss function defined in Equation EQREF11 becomes : DISPLAYFORM0

where ℒ adv and ℒ ortho denote the loss function for adversarial training and orthogonality constraints respectively , and λ and γ are hyperparameters .

Serial MTL Architecture

The second MTL architecture , called Serial , has the same set of encoders and decoders as the parallel MTL architecture . The differences are 1 ) the order of learning features and 2 ) the input for individual decoders . In this serial MTL architecture , three sets of features are learned in a sequential way in two stages . As shown in Figure
FIGREF15
, group encoders and a universe encoder encode group - level and fully shared universe - level features , respectively , based on input data . Then , task encoders use that concatenated feature representation to learn task - specific features . Finally , in this serial architecture , the individual task decoders use their corresponding private encoder outputs only to perform tasks . This contrasts with the parallel MTL architecture , which uses combinations of three feature representations as input to their respective task decoders .

Serial MTL Architecture with Highway Connections

Decoders in the Serial architecture , introduced in the previous section , do not have direct access to group and universe feature representations . However , directly utilizing these shared features could be beneficial for some tasks . Therefore , we add highway connections to incorporate universe encoder output and corresponding group encoder outputs as inputs to the individual decoders in addition to task - specific encoder output ; we call this model Serial+Highway . As shown in Figure
FIGREF15
, input to the task - specific encoders are the same as those in the serial MTL architecture , i.e. , the concatenation of the group and universe features . The input to each task - specific decoder , however , is now the concatenation of the features from the group encoder , the universe encoder , and the task - specific encoder .

Serial MTL Architecture with Highway Connections and Feature Swapping

In both serial MTL architectures introduced in the previous two sections , the input to the task encoders is the output of the more general group and universe encoders . That output potentially underrepresents some task - specific aspects of the input . Therefore , we introduce Serial+Highway+Swap ; a variant of Serial+Highway , in which the two stages of universe / group features and task - specific features are swapped . As shown in Figure
FIGREF15
, the task - specific representations are now learned in the first stage , and group and universe feature representations based on the task features are learned in the second stage . In this model , the task encoder directly takes input data and learns task - specific features . Then , the universe encoder and group encoders take the task - specific representations as input and generate fully shared universe and group - level representations , respectively . Finally , task - specific decoders use the concatenation of all three features – universe , group and task features , to perform the final tasks .

An Example of Encoder-Decoder Architecture for a Single Task

All four MTL architectures introduced in the previous sections are general such that they could be applied to many applications . In this section , we use the task of joint slot filling ( SF ) and intent classification ( IC ) for natural language understanding ( NLU ) systems for virtual assistants as an example . We design an encoder - decoder architecture to perform SF and IC as a joint task , on top of which the four MTL architectures are built .

Given an input sequence x=(x 1 ,...,x T ) , the goal is to jointly learn an equal - length tag sequence of slots y S =(y 1 ,...,y T ) and the overall intent label y I . By using a joint model , rather than two separate models , for SF and IC , we exploit the correlation of the two output spaces . For example , if the intent of a sentence is book_ride it is likely to contain the slot types from_address and destination_address , and vice versa . The Joint - SF - IC model architecture is shown in Figure
FIGREF22
. It is a simplified version compared to the SlotGated model
BIBREF11
, which showed state - of - the - art results in jointly modeling SF and IC . Our architecture uses neither slot / intent attention nor a slot gate .

To address the issues of small amounts of training data and out - of - vocabulary ( OOV ) words , we use character embeddings , learned during training , as well as pre - trained word embeddings
BIBREF12
.
These word and character representations are passed as input to the encoder , which is a bidirectional long short - term memory ( Bi - LSTM )
BIBREF8
layer that computes forward hidden state h t → and backward hidden state h t ← per time step t in the input sequence .
We then concatenate h t → and h t ← to get final hidden state h t =[h t →;h t ←] at time step t .

Slot Filling ( SF ) : For a given sentence x=(x 1 ,...,x T ) with T words , we use their respective hidden states h=(h 1 ,...,h T ) from the encoder ( Bi - LSTM layer ) to model tagging decisions y S =(y 1 ,...,y T ) jointly using a conditional random field ( CRF ) layer
BIBREF12
,
BIBREF13
: DISPLAYFORM0


where 𝒴 𝒮 is the set of all possible slot sequences , and f S is the CRF decoding function .

Intent Classification ( IC ) : Based on the hidden states from the encoder ( Bi - LSTM layer ) , we use the last forward hidden state h T → and last backward hidden state h 1 ← to compute the moment h I =[h T →;h 1 ←] which can be regarded as the representation of the entire input sentence . Lastly , the intent y I of the input sentence is predicted by feeding h I into a fully - connected layer with softmax activation function to generate the prediction for each intent : DISPLAYFORM0

where y I is the prediction label , W hy I is a weight matrix and b is a bias term .

Joint Optimization : As our decoder models a joint task of SF and IC , we define the loss as a weighted sum of individual losses which can be plugged into ℒ j i in Equation EQREF11 : DISPLAYFORM0

where ℒ SF is the cross - entropy loss based on the probability of the correct tag sequence
BIBREF12
, ℒ IC is the cross - entropy loss based on the predicted and true intent distributions
BIBREF10
and w SF , w IC are hyperparameters to adjust the weights of the two loss components .


Dataset

We evaluate our proposed models for multi - domain joint slot filling and intent classification for spoken language understanding systems . We use the following benchmark dataset and large - scale Alexa dataset for evaluation , and we use classic intent accuracy and slot F1 as in goo2018slot as evaluation metrics .

Benchmark Dataset : We consider two widely used datasets ATIS
BIBREF7
and Snips
BIBREF11
. The statistics of these datasets are shown in Table
TABREF27
. For each dataset , we use the same train / dev / test set as goo2018slot . ATIS is a single - domain ( Airline Travel ) dataset while Snips is a more complex multi - domain dataset due to the intent diversity and large vocabulary . For initial experiments , we use ATIS and Snips as two tasks . For multi - domain experiments , we split Snips into three domains – Music , Location , and Creative based on its intents and treat each one as an individual task . Thus for this second set of experiments , we have four tasks ( ATIS and Snips splits ) . Table
TABREF28
shows the new datasets obtained by splitting Snips . This new dataset allows us to introduce task groups . We define ATIS and Snips - location as one task group , and Snips - music and Snips - creative as another .

Alexa Dataset : We use live utterances spoken to 90 Alexa skills with the highest traffic . These are categorized into 10 domains , based on assignments by the developers of the individual skills . Each skill is a task in the MTL setting , and each domain acts as a task group . Due to the limited annotated datasets for skills , we do not have validation sets for these 90 skills . Instead , we use another 80 popular skills that fall into the same domain groups as the 90 skills as the validation set to tune model parameters . Table
TABREF29
shows the statistics of the Alexa dataset based on domains . For training and validation sets , we keep approximately the same number of skills per group to make sure that hyperparameters of adversarial training are unbiased . We use the validation datasets to choose the hyperparameters for the baselines as well as our proposed models .

Baselines

We compare our proposed model with the following three competitive architectures for single - task joint slot filling ( SF ) and intent classification ( IC ) , which have been widely used in prior literature :

JointSequence : hakkani2016multi proposed a Bi - LSTM joint model for slot filling , intent classification , and domain classification .

AttentionBased : liu2016attention showed that incorporating an attention mechanism into a Bi - LSTM joint model can reduce errors on intent detection and slot filling .

SlotGated : goo2018slot added a slot - gated mechanism into the traditional attention - based joint architecture , aiming to explicitly model the relationship between intent and slots , rather than implicitly modeling it with a joint loss .

We also compare our proposed model with two closely related multi - task learning ( MTL ) architectures that can be treated as simplified versions of our parallel MTL architecture :

Parallel[Univ ] : This model , proposed by liu2017adversarial , uses a universe encoder that is shared across all tasks , and decoders are task - specific .

Parallel[Univ+Task ] : This model , also proposed by liu2017adversarial , uses task - specific encoders in addition to the shared encoder . To ensure non - redundancy in features learned across shared and task - specific encoders , adversarial training and orthogonality constraints are incorporated .

Training Setup

All our proposed models are trained with backpropagation , and gradient - based optimization is performed using Adam
BIBREF14
. In all experiments , we set the character LSTM hidden size to 64 and word embedding LSTM hidden size to 128 . We use 300-dimension GloVe vectors
BIBREF15
for the benchmark datasets and in - house embeddings for the Alexa dataset , which are trained with Wikipedia data and live utterances spoken to Alexa . Character embedding dimensions and dropout rate are set to 100 and 0.5 respectively . Minimax optimization in adversarial training was implemented via the use of a gradient reversal layer
BIBREF16
,
BIBREF10
. The models are implemented with the TensorFlow library
BIBREF17
.

For benchmark data , the models are trained using an early - stop strategy with maximum epoch set to 50 and patience ( i.e. , number of epochs with no improvement on the dev set for both SF and IC ) to 6 . In addition , the benchmark dataset has varied size vocabularies across its datasets . To give equal importance to each of them , α i j ( see Equation EQREF11 ) is proportional to 1/n , where n is the training set size of task j in group i . We are able to train on CPUs , due to the low values of n .

For Alexa data , optimal hyperparameters are determined on the 80 development skills and applied to the training and evaluation of the 90 test skills . α i j is here set to 1 as all skills have 10,000 training utterances sampled from the respective developer - defined skill grammars
BIBREF6
. Here , training was done using GPU - enabled EC2 instances ( p2.8xlarge ) . Our detailed training algorithm is similar to the one used by collobert2008unified and liu2016recurrent , liu2017adversarial , where training is achieved in a stochastic manner by looping over the tasks . For example , an epoch involves these four steps : 1 ) select a random skill ; 2 ) select a random batch from the list of available batches for this skill ; 3 ) update the model parameters by taking a gradient step w.r.t this batch ; 4 ) update the list of available batches for this skill by removing the current batch .

Benchmark data

Table
TABREF37
shows the results on ATIS and the original version of the Snips dataset ( as shown in Table
TABREF27
) . In the first four lines , ATIS and Snips are trained separately . In the last two lines ( Parallel ) , they are treated as two tasks in the MTL setup . There are no task groups in this particular experiment , as each utterance belongs to either ATIS or Snips , and all utterances belong to the task universe . The Joint - SF - IC architecture with CRF layer performs better than all the three baseline models in terms of all evaluation metrics on both datasets , even after removing the slot - gate
BIBREF11
and attention
BIBREF18
. Learning universe features across both the datasets in addition to the task features help ATIS while performance on Snips degrades . This might be due to the fact that Snips is a multi - domain dataset , which in turn motivates us to split the Snips dataset ( as shown in Table
TABREF28
) , so that the tasks in each domain ( i.e. , task group ) may share features separately .

Table
TABREF38
shows results on ATIS and our split version of Snips . We now have four tasks : ATIS , Snips - location , Snips - music , and Snips - creative . Joint - SF - IC is our baseline that treats these four tasks independently . All other models process the four tasks together in the MTL setup . For the models introduced in this paper , we define two task groups : ATIS and Snips - location as one group , and Snips - music and Snips - creative as another . Our models , which use these groups , generally outperform the other MTL models ( Parallel[Univ ] and Parallel[Univ+Task ] ) ; especially the serial MTL architectures perform well .

Alexa data

Table
TABREF39
shows the results of the single - domain model and the MTL models on the Alexa dataset . The trend is clearly visible in these results compared to the results on the benchmark data . As Alexa data has more domains , there might not be many features that are common across all the domains . Capturing those features that are only common across a group became possible by incorporating task group encoders . Serial+Highway+Swap yields the best mean intent accuracy . Parallel+Univ+Group+Task and Serial+Highway show statistically indistinguishable results . For slot filling , all MTL architectures achieve competitive results on mean Slot F1 .

Overall , on both benchmark data and Alexa data , our architectures with group encoders show better results than others . Specifically , the serial architecture with highway connections achieves the best mean Slot F1 of 94.8 and 97.2 on Snips - music and Snips - location respectively and median Slot F1 of 81.99 on the Alexa dataset . Swapping its feature hierarchy enhances its intent accuracy to 97.5 on ATIS . It also achieves the best / competitive mean and median values on both SF and IC on the Alexa dataset . This supports our argument that when we try to learn common features across all the domains
BIBREF10
, we might miss crucial features that are only present across a group . Capturing those task group features boosts the performance of our unified model on SF and IC . In addition , when we attempt to learn three sets of features – task , task universe , and task group features – the serial architecture for feature learning helps . Specifically , when we have datasets from many domains , learning task features in the first stage and common features , i.e. , task universe and task group features , in the second stage yields the best results . This difference is more clearly visible in the results of the large - scale Alexa data than that of the small - scale benchmark dataset .

Result Analysis

To further investigate the performance of different architectures , we present the intent accuracy and slot F1 values on different groups of Alexa utterances in Tables
TABREF40
and
TABREF41
. For intent classification , Serial+Highway+Swap achieves the best results on six domains , and Parallel[Univ ] achieves the best results on the movie and news domains . Such a finding helps explain the reason why Parallel[Univ ] is significantly indistinguishable from Serial+Highway+Swap on the Alexa dataset , which is shown in Table
TABREF39
. Parallel[Univ ] outperforms MTL with group encoders when there is more information shared across domains . Examples of similar training utterances in different domains are “ go back eight hour ” and “ rewind for eighty five hour ” in a News skill ; “ to rewind the Netflix ” in a Smart Home skill ; and “ rewind nine minutes ” in a Music skill . The diverse utterance context in different domains could be learned through the universe encoder , which helps to improve the intent accuracy for these skills .

For the slot filling , each domain favors one of the four MTL architectures including Parallel[Univ ] , Parallel[Univ+Group+Task ] , Serial+Highway , and Serial+Highway+Swap . Such a finding is consistent with the statistically indistinguishable performance between different MTL architectures shown in Table
TABREF39
. Tables
TABREF44
and
TABREF45
show a few utterances from different datasets in the Smart Home category that are correctly predicted after learning task group features . General words like sixty , eight , alarm can have different slot types across different datasets . Learning features of the Smart Home category helps overcome such conflicts . However , a word in different tasks in the same domain can still have different slot types . For example , the first two utterances in Table
TABREF46
, which are picked from the Smart Home domain , have different slot types Name and Channel for the word one . In such cases , there is no guarantee that learning group features can overcome the conflicts . This might be due to the fact that the groups are predefined and they do not always represent the real task structure . To tackle this issue , learning task structures with features jointly
BIBREF19
rather than relying on predefined task groups , would be a future direction . In our experimental settings , all the universe , task , and task group encoders are instantiated with Bi - LSTM . An interesting area for future experimentation is to streamline the encoders , e.g. , adding additional bits to the inputs to the task encoder to indicate the task and group information , which is similar to the idea of using a special token as a representation of the language in a multilingual machine translation system
BIBREF20
.


Related Work

Multi - task learning ( MTL ) aims to learn multiple related tasks from data simultaneously to improve the predictive performance compared with learning independent models . Various MTL models have been developed based on the assumption that all tasks are related
BIBREF21
,
BIBREF22
,
BIBREF23
. To tackle the problem that task structure is usually unclear , evgeniou2004regularized extended support vector machines for single - task learning in a multi - task scenario by penalizing models if they are too far from a mean model . xue2007multi introduced a Dirichlet process prior to automatically identify subgroups of related tasks . passos2012flexible developed a nonparametric Bayesian model to learn task subspaces and features jointly .

On the other hand , with the advent of deep learning , MTL with deep neural networks has been successfully applied to different applications
BIBREF24
,
BIBREF25
,
BIBREF26
,
BIBREF27
. Recent work on multi - task learning considers different sharing structures , e.g. , only sharing at lower layers
BIBREF28
and introduces private and shared subspaces
BIBREF29
,
BIBREF10
. liu2017adversarial incorporated adversarial loss and orthogonality constraints into the overall training object , which helps in learning task - specific and task - invariant features in a non - redundant way . However , they do not explore task structures , which can contain crucial features only present within groups of tasks . Our work encodes task structure information in deep neural architectures .

Conclusions

We proposed a series of end - to - end multi - task learning architectures , in which task , task group and task universe features are learned non - redundantly . We further explored learning these features in parallel and serial MTL architectures . Our MTL models obtain state - of - the - art performance on the ATIS and Snips datasets for intent classification and slot filling . Experimental results on a large - scale Alexa dataset show the effectiveness of adding task group encoders into both parallel and serial MTL networks .

Acknowledgments

We thank Lambert Mathias for providing insightful feedback and Sandesh Swamy for preparing the Alexa test dataset . We also thank our team members as well as the anonymous reviewers for their valuable comments .

Specifically encoder
Bi - LSTM the encoder classification or sequential labeling
the parallel MTL model minimize individual task - specific losses that
ℒ SF the entropy loss based on the probability of the correct tag sequence BIBREF12
ℒ adv ℒ ortho the loss function for adversarial training and orthogonality constraints respectively , and and
filling SF classification IC natural language understanding
term memory short -
ℒ IC the on
Intent Classification IC the
loss a be
ATIS a single - domain
ATIS - location one task group , and Snips - music and Snips - creative as
- location one group , and Snips - music and Snips -
group information special token a representation of the language in a multilingual machine translation system BIBREF20
xue2007multi a Dirichlet process prior to automatically identify of related tasks
overall object helps
What do character-level models learn about morphology? The case of dependency parsing 1808.09180 2018 D18-1278
universal dependency parsing
oracle morphology
dependency parsing
morphological annotation
model morphology

Introduction

Modeling language input at the character level
BIBREF0
,
BIBREF1
is effective for many NLP tasks , and often produces better results than modeling at the word level . For parsing , ballesteros - dyer - smith:2015 : EMNLP have shown that character - level input modeling is highly effective on morphologically - rich languages , and the three best systems on the 45 languages of the CoNLL 2017 shared task on universal dependency parsing all use character - level models
BIBREF2
, BIBREF3 ,
BIBREF4
, BIBREF5 , showing that they are effective across many typologies .

The effectiveness of character - level models in morphologically - rich languages has raised a question and indeed debate about explicit modeling of morphology in NLP .
BIBREF0
propose that “ prior information regarding morphology ... among others , should be incorporated ” into character - level models , while
BIBREF6
counter that it is “ unnecessary to consider these prior information ” when modeling characters . Whether we need to explicitly model morphology is a question whose answer has a real cost : as ballesteros - dyer - smith:2015 : EMNLP note , morphological annotation is expensive , and this expense could be reinvested elsewhere if the predictive aspects of morphology are learnable from strings .

Do character - level models learn morphology ? We view this as an empirical claim requiring empirical evidence . The claim has been tested implicitly by comparing character - level models to word lookup models
BIBREF7
,
BIBREF8
. In this paper , we test it explicitly , asking how character - level models compare with an oracle model with access to morphological annotations . This extends experiments showing that character - aware language models in Czech and Russian benefit substantially from oracle morphology BIBREF9 , but here we focus on dependency parsing ( § SECREF2 ) — a task that benefits substantially from morphological knowledge — and we experiment with twelve languages using a variety of techniques to probe our models .

Our summary finding is that character - level models lag the oracle in nearly all languages ( § SECREF3 ) . The difference is small , but suggests that there is value in modeling morphology . When we tease apart the results by part of speech and dependency type , we trace the difference back to the character - level model 's inability to disambiguate words even when encoded with arbitrary context ( § SECREF4 ) . Specifically , it struggles with case syncretism , in which noun case — and thus syntactic function — is ambiguous . We show that the oracle relies on morphological case , and that a character - level model provided only with morphological case rivals the oracle , even when case is provided by another predictive model ( § SECREF5 ) . Finally , we show that the crucial morphological features vary by language ( § SECREF6 ) .

Dependency parsing model

We use a neural graph - based dependency parser combining elements of two recent models
BIBREF10
, BIBREF11 . Let w=w 1 ,⋯,w |w| be an input sentence of length |w| and let w 0 denote an artificial Root token . We represent the i th input token w i by concatenating its word representation ( § SECREF10 ) , 𝐞(w i ) and part - of - speech ( POS ) representation , 𝐩 i . Using a semicolon (;) to denote vector concatenation , we have : DISPLAYFORM0

We call 𝐱 i the embedding of w i since it depends on context - independent word and POS representations . We obtain a context - sensitive encoding 𝐡 i with a bidirectional LSTM ( bi - LSTM ) , which concatenates the hidden states of a forward and backward LSTM at position i . Using 𝐡 i f and 𝐡 i b respectively to denote these hidden states , we have : DISPLAYFORM0

We use 𝐡 i as the final input representation of w i .

Head prediction

For each word w i , we compute a distribution over all other word positions j∈{0,...,|w|}/i denoting the probability that w j is the headword of w i . DISPLAYFORM0

Here , a is a neural network that computes an association between w i and w j using model parameters 𝐔 a ,𝐖 a , and 𝐯 a . DISPLAYFORM0

Label prediction

Given a head prediction for word w i , we predict its syntactic label ℓ k ∈L using a similar network . DISPLAYFORM0

where L is the set of output labels and f is a function that computes label score using model parameters 𝐔 ℓ ,𝐖 ℓ , and 𝐕 ℓ : DISPLAYFORM0

The model is trained to minimize the summed cross - entropy losses of both head and label prediction . At test time , we use the Chu - Liu - Edmonds
BIBREF12
, BIBREF13 algorithm to ensure well - formed , possibly non - projective trees .

Computing word representations

We consider several ways to compute the word representation 𝐞(w i ) in Eq . EQREF2 :

Every word type has its own learned vector representation .

Characters are composed using a bi - LSTM
BIBREF0
, and the final states of the forward and backward LSTMs are concatenated to yield the word representation .

Characters are composed using a convolutional neural network
BIBREF1
.

Character trigrams are composed using a bi - LSTM , an approach that we previously found to be effective across typologies BIBREF9 .

We treat the morphemes of a morphological annotation as a sequence and compose them using a bi - LSTM . We only use universal inflectional features defined in the UD annotation guidelines . For example , the morphological annotation of “ chases ” is chase , person=3rd , num - SG , tense = Pres .

For the remainder of the paper , we use the name of model as shorthand for the dependency parser that uses that model as input ( Eq . EQREF2 ) .

We experiment on twelve languages with varying morphological typologies ( Table
TABREF19
) in the Universal Dependencies ( UD ) treebanks version 2.0
BIBREF14
. Note that while Arabic and Hebrew follow a root & pattern typology , their datasets are unvocalized , which might reduce the observed effects of this typology . Following common practice , we remove language - specific dependency relations and multiword token annotations . We use gold sentence segmentation , tokenization , universal POS ( UPOS ) , and morphological ( XFEATS ) annotations provided in UD .

Our Chainer
BIBREF15
implementation encodes words ( Eq . EQREF3 ) in two - layer bi - LSTMs with 200 hidden units , and uses 100 hidden units for head and label predictions ( output of Eqs . 4 and 6 ) . We set batch size to 16 for char - cnn and 32 for other models following a grid search . We apply dropout to the embeddings ( Eq . EQREF2 ) and the input of the head prediction . We use Adam optimizer with initial learning rate 0.001 and clip gradients to 5 , and train all models for 50 epochs with early stopping . For the word model , we limit our vocabulary to the 20 K most frequent words , replacing less frequent words with an unknown word token . The char - lstm , trigram - lstm , and oracle models use a one - layer bi - LSTM with 200 hidden units to compose subwords . For char - cnn , we use the small model setup of kim2015 .

Table
TABREF20
presents test results for every model on every language , establishing three results . First , they support previous findings that character - level models outperform word - based models — indeed , the char - lstm model outperforms the word model on LAS for all languages except Hindi and Urdu for which the results are identical . Second , they establish strong baselines for the character - level models : the char - lstm generally obtains the best parsing accuracy , closely followed by char - cnn . Third , they demonstrate that character - level models rarely match the accuracy of an oracle model with access to explicit morphology . This reinforces a finding of BIBREF9 : character - level models are effective tools , but they do not learn everything about morphology , and they seem to be closer to oracle accuracy in agglutinative rather than in fusional languages .

In character - level models , orthographically similar words share many parameters , so we would expect these models to produce good representations of OOV words that are morphological variants of training words . Does this effect explain why they are better than word - level models ?

Table
TABREF23
shows how the character model improves over the word model for both non - OOV and OOV words . On the agglutinative languages Finnish and Turkish , where the OOV rates are 23 % and 24 % respectively , we see the highest LAS improvements , and we see especially large improvements in accuracy of OOV words . However , the effects are more mixed in other languages , even with relatively high OOV rates . In particular , languages with rich morphology like Czech , Russian , and ( unvocalised ) Arabic see more improvement than languages with moderately rich morphology and high OOV rates like Portuguese or Spanish . This pattern suggests that parameter sharing between pairs of observed training words can also improve parsing performance . For example , if “ dog ” and “ dogs ” are observed in the training data , they will share activations in their context and on their common prefix .

Let 's turn to our main question : what do character - level models learn about morphology ? To answer it , we compare the oracle model to char - lstm , our best character - level model .

In the oracle , morphological annotations disambiguate some words that the char - lstm must disambiguate from context . Consider these Russian sentences from baerman - brown - corbett-2005 :

Maša čitaet pisˊmo

Masha reads letter

` Masha reads a letter . '

Na stole ležit pisˊmo

on table lies letter

` There 's a letter on the table . ' Pisˊmo ( “ letter ” ) acts as the subject in ( UID28 ) , and as object in ( UID28 ) . This knowledge is available to the oracle via morphological case : in ( UID28 ) , the case of pisˊmo is nominative and in ( UID28 ) it is accusative . Could this explain why the oracle outperforms the character model ?

To test this , we look at accuracy for word types that are empirically ambiguous — those that have more than one morphological analysis in the training data . Note that by this definition , some ambiguous words will be seen as unambiguous , since they were seen with only one analysis . To make the comparison as fair as possible , we consider only words that were observed in the training data . Figure
FIGREF29
compares the improvement of the oracle on ambiguous and seen unambiguous words , and as expected we find that handling of ambiguous words improves with the oracle in almost all languages . The only exception is Turkish , which has the least training data .

Now we turn to a more fine - grained analysis conditioned on the annotated part - of - speech ( POS ) of the dependent . We focus on four languages where the oracle strongly outperforms the best character - level model on the development set : Finnish , Czech , German , and Russian . We consider five POS categories that are frequent in all languages and consistently annotated for morphology in our data : adjective ( ADJ ) , noun ( NOUN ) , pronoun ( PRON ) , proper noun ( PROPN ) , and verb ( VERB ) .

Table
TABREF26
shows that the three noun categories — ADJ , PRON , and PROPN — benefit substantially from oracle morphology , especially for the three fusional languages : Czech , German , and Russian .

We analyze results by the dependency type of the dependent , focusing on types that interact with morphology : root , nominal subjects ( nsubj ) , objects ( obj ) , indirect objects ( iobj ) , nominal modifiers ( nmod ) , adjectival modifier ( amod ) , obliques ( obl ) , and ( syntactic ) case markings ( case ) .

Figure
FIGREF33
shows the differences in the confusion matrices of the char - lstm and oracle for those words on which both models correctly predict the head . The differences on Finnish are small , which we expect from the similar overall LAS of both models . But for the fusional languages , a pattern emerges : the char - lstm consistently underperforms the oracle on nominal subject , object , and indirect object dependencies — labels closely associated with noun categories . From inspection , it appears to frequently mislabel objects as nominal subjects when the dependent noun is morphologically ambiguous . For example , in the sentence of Figure
FIGREF35
, Gelände ( “ terrain ” ) is an object , but the char - lstm incorrectly predicts that it is a nominal subject . In the training data , Gelände is ambiguous : it can be accusative , nominative , or dative .

In German , the char - lstm frequently confuses objects and indirect objects . By inspection , we found 21 mislabeled cases , where 20 of them would likely be correct if the model had access to morphological case ( usually dative ) . In Czech and Russian , the results are more varied : indirect objects are frequently mislabeled as objects , obliques , nominal modifiers , and nominal subjects . We note that indirect objects are relatively rare in these data , which may partly explain their frequent mislabeling .

So far , we 've seen that for our three fusional languages — German , Czech , and Russian — the oracle strongly outperforms a character model on nouns with ambiguous morphological analyses , particularly on core dependencies : nominal subjects , objects and indirect objects . Since the nominative , accusative , and dative morphological cases are strongly ( though not perfectly ) correlated with these dependencies , it is easy to see why the morphologically - aware oracle is able to predict them so well . We hypothesized that these cases are more challenging for the character model because these languages feature a high degree of syncretism — functionally distinct words that have the same form — and in particular case syncretism . For example , referring back to examples ( UID28 ) and ( UID28 ) , the character model must disambiguate pisˊmo from its context , whereas the oracle can directly disambiguate it from a feature of the word itself .

To understand this , we first designed an experiment to see whether the char - lstm could successfully disambiguate noun case , using a method similar to
BIBREF8
. We train a neural classifier that takes as input a word representation from the trained parser and predicts a morphological feature of that word — for example that its case is nominative ( Case = Nom ) . The classifier is a feedforward neural network with one hidden layer , followed by a ReLU non - linearity . We consider two representations of each word : its embedding ( 𝐱 i ; Eq . EQREF2 ) and its encoding ( 𝐡 i ; Eq . EQREF3 ) . To understand the importance of case , we consider it alongside number and gender features as well as whole feature bundles .

Table
TABREF37
shows the results of morphological feature classification on Czech ; we found very similar results in German and Russian ( Appendix SECREF58 ) . The oracle embeddings have almost perfect accuracy — and this is just what we expect , since the representation only needs to preserve information from its input . The char - lstm embeddings perform well on number and gender , but less well on case . This results suggest that the character - level models still struggle to learn case when given only the input text . Comparing the char - lstm with a baseline model which predicts the most frequent feature for each type in the training data , we observe that both of them show similar trends even though character models slightly outperforms the baseline model .

The classification results from the encoding are particularly interesting : the oracle still performs very well on morphological case , but less well on other features , even though they appear in the input . In the character model , the accuracy in morphological prediction also degrades in the encoding — except for case , where accuracy on case improves by 12 % .

These results make intuitive sense : representations learn to preserve information from their input that is useful for subsequent predictions . In our parsing model , morphological case is very useful for predicting dependency labels , and since it is present in the oracle 's input , it is passed almost completely intact through each representation layer . The character model , which must disambiguate case from context , draws as much additional information as it can from surrounding words through the LSTM encoder . But other features , and particularly whole feature bundles , are presumably less useful for parsing , so neither model preserves them with the same fidelity .

Our analysis indicates that case is important for parsing , so it is natural to ask : Can we improve the neural model by explicitly modeling case ? To answer this question , we ran a set of experiments , considering two ways to augment the char - lstm with case information : multitask learning
BIBREF16
and a pipeline model in which we augment the char - lstm model with either predicted or gold case . For example , we use p , i , z , z , a , Nom to represent pizza with nominative case . For MTL , we follow the setup of
BIBREF17
and
BIBREF18
. We increase the biLSTMs layers from two to four and use the first two layers to predict morphological case , leaving out the other two layers specific only for parser . For the pipeline model , we train a morphological tagger to predict morphological case ( Appendix SECREF56 ) . This tagger does not share parameters with the parser .

Table
TABREF40
summarizes the results on Czech , German , and Russian . We find augmenting the char - lstm model with either oracle or predicted case improve its accuracy , although the effect is different across languages . The improvements from predicted case results are interesting , since in non - neural parsers , predicted case usually harms accuracy
BIBREF19
. However , we note that our taggers use gold POS , which might help . The MTL models achieve similar or slightly better performance than the character - only models , suggesting that supplying case in this way is beneficial . Curiously , the MTL parser is worse than the the pipeline parser , but the MTL case tagger is better than the pipeline case tagger ( Table
TABREF41
) . This indicates that the MTL model must learn to encode case in the model 's representation , but must not learn to effectively use it for parsing . Finally , we observe that augmenting the char - lstm with either gold or predicted case improves the parsing performance for all languages , and indeed closes the performance gap with the full oracle , which has access to all morphological features . This is especially interesting , because it shows using carefully targeted linguistic analyses can improve accuracy as much as wholesale linguistic analysis .

The previous experiments condition their analysis on the dependent , but dependency is a relationship between dependents and heads . We also want to understand the importance of morphological features to the head . Which morphological features of the head are important to the oracle ?

To see which morphological features the oracle depends on when making predictions , we augmented our model with a gated attention mechanism following kuncoro - EtAl:2017 : EACLlong . Our new model attends to the morphological features of candidate head w j when computing its association with dependent w i ( Eq . EQREF5 ) , and morpheme representations are then scaled by their attention weights to produce a final representation .

Let f i1 ,⋯,f ik be the k morphological features of w i , and denote by 𝐟 i1 ,⋯,𝐟 ik their corresponding feature embeddings . As in § SECREF2 , 𝐡 i and 𝐡 j are the encodings of w i and w j , respectively . The morphological representation 𝐦 j of w j is : DISPLAYFORM0

where 𝐤 is a vector of attention weights : DISPLAYFORM0

The intuition is that dependent w i can choose which morphological features of w j are most important when deciding whether w j is its head . Note that this model is asymmetric : a word only attends to the morphological features of its ( single ) parent , and not its ( many ) children , which may have different functions .

We combine the morphological representation with the word 's encoding via a sigmoid gating mechanism . DISPLAYFORM0

where denotes element - wise multiplication . The gating mechanism allows the model to choose between the computed word representation and the weighted morphological representations , since for some dependencies , morphological features of the head might not be important . In the final model , we replace Eq . EQREF5 and Eq . EQREF6 with the following : DISPLAYFORM0

The modified label prediction is : DISPLAYFORM0

where f is again a function to compute label score : DISPLAYFORM0

We trained our augmented model ( oracle - attn ) on Finnish , German , Czech , and Russian . Its accuracy is very similar to the oracle model ( Table
TABREF51
) , so we obtain a more interpretable model with no change to our main results .

Next , we look at the learned attention vectors to understand which morphological features are important , focusing on the core arguments : nominal subjects , objects , and indirect objects . Since our model knows the case of each dependent , this enables us to understand what features it seeks in potential heads for each case . For simplicity , we only report results for words where both head and label predictions are correct .

Figure
FIGREF52
shows how attention is distributed across multiple features of the head word . In Czech and Russian , we observe that the model attends to Gender and Number when the noun is in nominative case . This makes intuitive sense since these features often signal subject - verb agreement . As we saw in earlier experiments , these are features for which a character model can learn reliably good representations . For most other dependencies ( and all dependencies in German ) , Lemma is the most important feature , suggesting a strong reliance on lexical semantics of nouns and verbs . However , we also notice that the model sometimes attends to features like Aspect , Polarity , and VerbForm — since these features are present only on verbs , we suspect that the model may simply use them as convenient signals that a word is verb , and thus a likely head for a given noun .

Character - level models are effective because they can represent OOV words and orthographic regularities of words that are consistent with morphology . But they depend on context to disambiguate words , and for some words this context is insufficient . Case syncretism is a specific example that our analysis identified , but the main results in Table
TABREF20
hint at the possibility that different phenomena are at play in different languages .

While our results show that prior knowledge of morphology is important , they also show that it can be used in a targeted way : our character - level models improved markedly when we augmented them only with case . This suggests a pragmatic reality in the middle of the wide spectrum between pure machine learning from raw text input and linguistically - intensive modeling : our new models do n't need all prior linguistic knowledge , but they clearly benefit from some knowledge in addition to raw input . While we used a data - driven analysis to identify case syncretism as a problem for neural parsers , this result is consistent with previous linguistically - informed analyses
BIBREF20
,
BIBREF19
. We conclude that neural models can still benefit from linguistic analyses that target specific phenomena where annotation is likely to be useful .

Clara Vania is supported by the Indonesian Endowment Fund for Education ( LPDP ) , the Centre for Doctoral Training in Data Science , funded by the UK EPSRC ( grant EP / L016427/1 ) , and the University of Edinburgh . We would like to thank Yonatan Belinkov for the helpful discussion regarding morphological tagging experiments . We thank Sameer Bansal , Marco Damonte , Denis Emelin , Federico Fancellu , Sorcha Gilroy , Jonathan Mallinson , Joana Ribeiro , Naomi Saphra , Ida Szubert , Sabine Weber , and the anonymous reviewers for helpful discussion of this work and comments on previous drafts of the paper .

We adapt the parser 's encoder architecture for our morphological tagger . Following notation in Section SECREF2 , each word w i is represented by its context - sensitive encoding , 𝐡 i ( Eq . EQREF3 ) . The encodings are then fed into a feed - forward neural network with two hidden layers — each has a ReLU non - linearity — and an output layer mapping the to the morphological tags , followed by a softmax . We set the size of the hidden layer to 100 and use dropout probability 0.2 . We use Adam optimizer with initial learning rate 0.001 and clip gradients to 5 . We train each model for 20 epochs with early stopping . The model is trained to minimized the cross - entropy loss .

Since we do not have additional data with the same annotations , we use the same UD dataset to train our tagger . To prevent overfitting , we only use the first 75 % of training data for training . After training the taggers , we predict the case for the training , development , and test sets and use them for dependency parsing .

Table
TABREF59
and
TABREF60
present morphological tagging results for German and Russian . We found that German and Russian have similar pattern to Czech ( Table
TABREF37
) , where morphological case seems to be preserved in the encoder because they are useful for dependency parsing . In these three fusional languages , contextual information helps character - level model to predict the correct case . However , its performance still behind the oracle .

We observe a slightly different pattern on Finnish results ( Table
TABREF61
) . The character embeddings achieves almost similar performance as the oracle embeddings . This results highlights the differences in morphological process between Finnish and the other fusional languages . We observe that performance of the encoder representations are slightly worse than the embeddings .

- level the character model 's
element - wise multiplication
representation word
§ SECREF10 INLINEFORM5 part - of - speech POS
Pisˊmo letter ” the subject in ( UID28 ) , and
UID28 object in (
ADJ noun NOUN pronoun
PRON proper
nominal modifiers adjectival modifier
German the char - objects and
classifier a feedforward neural network
augmented model attn oracle -
character embeddings achieves almost similar performance as the oracle embeddings
Modality-based Factorization for Multimodal Fusion 1811.12624 2018 W19-4331
sentiment analysis
data fusion
machine learning
many machine learning
multimodal fusion

Introduction and Related Works

Multimodal data fusion is a desirable method for many machine learning tasks where information is available from multiple source modalities , typically achieving better predictions . Multimodal integration can handle missing data from one or more modalities . Since some modalities can include noise , it can lead to more robust prediction . Moreover , since some information may not be visible in some modalities or a single modality may not be powerful enough for a specific task , considering multiple modalities often improves performance
BIBREF0
,
BIBREF1
,
BIBREF2
.

For example , humans assign personality traits to each other , as well as to virtual characters by inferring personality from diverse cues , both behavioral and verbal , suggesting that a model to predict personality should take into account multiple modalities such as language , speech , and visual cues .

Multimodal fusion has a very broad range of applications , including audio - visual speech recognition
BIBREF0
, multimodal emotion recognition
BIBREF1
, medical
image analysis
BIBREF3
, and multimedia event detection
BIBREF4
, Personality trait detection
BIBREF2
, and sentiment analysis BIBREF5 .


According to the recent work by BIBREF6 , the techniques for multimodal fusion can be divided into early , late and hybrid approaches . Early approaches combine the multimodal features immediately by simply concatenating them
BIBREF7
. Late fusion combines the decision for each modality ( either classification , or regression ) , by voting
BIBREF8
, averaging
BIBREF9
or weighted sum of the outputs of the learned models
BIBREF10
,
BIBREF9
.
The hybrid approach combines the prediction by early fusion and unimodal predictions .

It has been observed that early fusion ( feature level fusion ) concentrates on the inter - modality information rather than intra - modality information BIBREF5 due to the fact that inter - modality information can be more complicated at the feature level and dominates the learning process . On the other hand , these fusion approaches are not powerful enough to extract the inter - modality integration model and they are limited to some simple combining methods BIBREF5 .

Zadeh et.al . BIBREF5 proposed combining n modalities by computing an n dimensional tensor as a tensor product of the n different modality representations followed by a flattening operation , in order to include 1-st order to n - th order inter modality relations . This is then fed to a neural network model to make predictions . The authors show that their proposed method improves the accuracy by considering both inter - modality and intra - modality relations . However , the generated representation has a very large dimension which leads to a very large hidden layer and therefore a huge number of parameters . Recently
BIBREF11
proposed a factorization approach in order to achieve a factorized version of the weight matrix which leads to fewer parameters while maintaining model accuracy . They use a CANDECOMP / PARAFAC decomposition BIBREF12 ,
BIBREF13
which follows Eq . EQREF1 in order to decompose a tensor W∈ℝ d 1 ×...d M to several 1-dimensional vectors w m i ∈ℝ d k : DISPLAYFORM0

where is the outer product operator , λ i s are scalar weights to combine rank 1 decompositions . The authors approach used the same factorization rate for all modalities , i.e. r is shared for all the modalities , resulting in the same compression rate for all the modalities , and is not able to allow for varying compression rates between modalities . Previous studies have found that some modalities are more informative than others
BIBREF14
,
BIBREF2
, suggesting that allowing different compression rates for different modalities should improve performance .

Our method , Modality - based Redundancy Reduction multimodal Fusion ( MRRF ) , uses Tuckers tensor decomposition instead ( see the Methodology section ) , which uses different factorization rates for each modality , allowing for variations in the amount of useful information between modalities . Modality - specific factors are chosen by maximising performance on a validation set . Applying a modality - based factorization method , results in removing the redundant information in the aforementioned high - order dependency structure and leads to fewer parameters with minimal information loss . Our method , works as a regularizer which leads to a less complicated model and reduces overfitting . In addition , our modality - based factorization approach helps to figure out the amount of useful information in each modality for the task at hand .

We evaluate the performance of our approach using sentiment analysis , personality detection , and emotion recognition from audio , text and video frames . The method reduces the number of parameters which requires fewer training samples , providing efficient training for the smaller datasets , and accelerating both training and prediction . Our experimental results demonstrate that the proposed approach can make notable improvements , in terms of accuracy , mean average error ( MAE ) , correlation , and F 1 score , specially for the applications with more complicated inter - modality relations .

We further study the effect of different factorization rates for different modalities . Our results on the importance of each modality for each task supports the previous results on the usefulness of each modality for personality recognition , emotion recognition and sentiment analysis . Moreover , our results demonstrate that our factorization approach avoids underfitting and overfitting for very simple and very large models , respectively .

In the sequel , we first explain the notation used in this paper . We elaborate on the details of our proposed method in methodology section . In the following section we go on to describe our experimental setup . In the results section , we compare the performance of MRRF and state - of - the - art baselines on three different datasets and discuss the effect of factorization rate on each modality . Finally , we provide a brief conclusion of the approach and the results .

Notation

The operator is the outer product operator which z 1 ⊗...⊗z M for z i ∈ℝ d i leads to a M - dimentional tensor in ℝ d 1 ×...d M . The operator × k , for a given k , is k - mode product of a tensor R∈ℝ r 1 ×r 2 ×...×r M and a matrix W∈ℝ d k ×r k as W× k R , which results in a tensor R ¯∈ℝ r 1 ×...×r k-1 ×d k ×r k+1 ×...×r M . This operator first flattens the tensor R and converts it to a matrix R ^∈R r k ×r 1 ...r k-1 r k+1 ...r M . The next step is a simple matrix product as WR ^∈ℝ d k ×r 1 ...r k-1 r k+1 ...r M . By unflattening the resulted matrix , we can convert it to a tensor in ℝ r 1 ×...×r k-1 ×d k ×r k+1 ×...×r M .

Methodology

We propose Modality - based Redundancy Reduction Fusion ( MRRF ) , a tensor fusion and factorisation method allowing for modality specific compression rates , combining the power of tensor fusion methods with a reduced parameter complexity .

Instead of simply concatenating the feature vectors of each modality , or using expensive fusion approaches such as the tensor fusion method by BIBREF5 , our aim is to use a compressed tensor fusion method .

We have used Tucker 's tensor decomposition method BIBREF15 ,
BIBREF16
which decomposes an M dimensional tensor W∈ℝ d 1 ×d 2 ×...×d M to a core tensor R∈ℝ r 1 ×r 2 ×...×r M and M matrices W i ∈ℝ r i ×d i , with r i ≤d i , as it can be seen in Eq .
EQREF2 . DISPLAYFORM0

The operator × k is a k - mode product of a tensor R∈ℝ r 1 ×r 2 ×...×r M and a matrix W∈ℝ d k ×r k as R× k W k , which results in a tensor R ¯∈ℝ r 1 ×...×r k-1 ×d k ×r k+1 ×...×r M .

For M modalities with representations D 1 , D 2 , ... and D M of size d 1 , d 2 , ... and d M , an M -modal tensor fusion approach as proposed by the authors of BIBREF5 leads to a tensor D=D 1 ⊗D 2 ⊗...⊗D m ∈ℝ d 1 ×d 2 ×...×d M . The authors proposed flattening the tensor layer in the deep network which results in loss of the information included in the tensor structure . In this paper , we propose to avoid the flattening and follow Eq . EQREF3 with weight tensor W∈ℝ h×d 1 ×d 2 ×...×d M and bias vector b∈ℝ h , where leads to an output layer of size h . DISPLAYFORM0

The above equation suffers from a large number of parameters ( O(∏ i=1 d i h) ) which requires a large number of the training samples , huge time and space , and easily overfits . In order to reduce the number of parameters , we propose to use Tucker 's tensor decomposition BIBREF15 ,
BIBREF16
as shown in Eq . EQREF4 , which works as a low - rank regularizer
BIBREF17
. DISPLAYFORM0

The non - diagonal core tensor R maintain inter - modality information after compression , despite the factorization proposed by
BIBREF11
which loses part of inter - modality information .


After combining the core tensor R and the output matrix W M+1 , the decomposition in Eq . EQREF4 reduces to Eq . EQREF5 : DISPLAYFORM0

Substituting Eq . EQREF5 into Eq . EQREF3 leads to a factorized multimodal integration model . Figure
FIGREF6
presents an example of this process with two input modalities .

For a multimodal deep neural network architecture consisting of three separate channels for audio , text , and video , we can represent the method as seen in Fig .
FIGREF7
. It is worth mentioning that a simple outer product of the input features leads only to the high - order trimodal dependencies . In order to overcome this drawback , the input feature vectors for each modality have been padded by 1 and thus also obtain the unimodal and bimodal dependencies . Algorithm SECREF3 shows the whole MRRF algorithm .

Tensor Factorization Layer . Input : n input modalities D 1 ,D 2 ,...,D n of size d 1 ,d 2 ,...,d n , correspondingly .

Initialization : factorization size for each modality r 1 ,r 2 ,...,r n .

[ 1 ] Compute tensor D=D 1 ⊗D 2 ⊗...⊗D n Generate the layers for out=WD+b which W=R ^× 1 W 1 × 2 ...× M W M in order to transform the high - dimensional tensor D to the output h . Use Adam optimizer for the differentiable tensor factorization layer to find the unknown parameters W 1 ,W 2 ,...,W n ,R ^,b . Output : Factors for Weight Matrix W : W 1 ,W 2 ,...,W n ,R .

It is notable to mention that the factorization step is task dependent , included in the deep network structure and learned during network training . Thus , for follow - up learning tasks , we would learn a new factorization specific to the task at hand , typically also estimating optimal compression ratios as described in the discussion section . In this process , any shared , helpful information is retained , as demonstrated by our results .

Following our proposed approach , we have decomposed the trainable W tensor to four substantially smaller trainable matrices ( W 1 ,W 2 ,W 3 ,R ) leading to O(∑ i=1 M (d i *r i )+∏ i=1 M r i *h) parameters .

For the feature level information of size d 1 , d 2 and d 3 for three different modalities , concat fusion ( CF ) leads to a layer size of O(∑ i=1 M d i ) and O(∑ i=1 M d i *h) parameters .

The tensor fusion approach ( TF ) , Applying the flattening method directly to Eq . EQREF3 , leads to a layer size of O(∏ i=1 M d i ) , and O(∏ i=1 M d i *h) parameters . The LMF approach
BIBREF11
requires training O(∑ i=1 M r*h*d i ) parameters , where r is the rank used for all the modalities .

It can be seen that the number of parameters in the proposed approach is substantially fewer than the simple tensor fusion ( TF ) approach and comparable to the LMF approach . For example , most often r is presumable which leads to much fewer parameters for MRRF than LMF . However the simple concatenating approach has fewer parameters which leads to worse performance as a result of biasing toward the intra - modality information representation than the inter - modality information fusion .

Datasets

We perform our experiments on the following multimodal datasets : CMU - MOSI BIBREF18 , POM
BIBREF19
, and IEMOCAP BIBREF20 for sentiment analysis , speaker traits recognition , and emotion recognition , respectively .
These tasks can be done by integrating both verbal and nonverbal behaviors of the persons .

The CMU - MOSI dataset is annotated on a seven - step scale as highly negative , negative , weakly negative , neutral , weakly positive , positive , highly positive which can be considered as a 7 class classification problem with 7 labels in the range [-3,+3] . The dataset is an annotated dataset of 2199 opinion utterances from 93 distinct YouTube movie reviews , each containing several opinion segments . Segments average of 4.2 seconds in length .

The POM dataset is composed of 903 movie review videos . Each video is annotated with the following speaker traits : confident , passionate , voice pleasant , dominant , credible , vivid , expertise , entertaining , reserved , trusting , relaxed , outgoing , thorough , nervous , persuasive and humorous .

The IEMOCAP dataset is a collection of 151 videos of recorded dialogues , with 2 speakers per session for a total of 302 videos across the dataset . Each segment is annotated for the presence of 9 emotions ( angry , excited , fear , sad , surprised , frustrated , happy , disgust and neutral ) .

Each dataset consists of three modalities , namely language , visual , and acoustic . The visual and acoustic features are calculated by taking the average of their feature values over the word time interval BIBREF21 . In order to perform time alignment across modalities , the three modalities are aligned using P2FA BIBREF22 at the word level .

Pre - trained 300-dimensional Glove word embeddings BIBREF21 were used to extract the language feature representations , which encodes a sequence of the transcribed words into a sequence of vectors .

Visual features for each frame ( sampled at 30Hz ) are extracted using the library Facet which includes 20 facial action units , 68 facial landmarks , head pose , gaze tracking and HOG features BIBREF23 .

COVAREP acoustic analysis framework BIBREF24 is used to extract low - level acoustic features , including 12 Mel frequency cepstral coefficients ( MFCCs ) , pitch , voiced / unvoiced segmentation , glottal source , peak slope , and maxima dispersion quotient features .

To evaluate model generalization , all datasets are split into training , validation , and test sets such that the splits are speaker independent , i.e. , no speakers from the training set are present in the test sets . Table
TABREF10
illustrates the data splits for all the datasets in detail .

Model Architecture

Similarly to
BIBREF11
, we use a simple model architecture for extracting the representations for each modality . We used three unimodal sub - embedding networks to extract representations z a , z v and z l for each modality , respectively . For acoustic and visual modalities , the sub - embedding network is a simple 2-layer feed - forward neural network , and for language , we used a long short - term memory ( LSTM ) network
BIBREF25
.


We tuned the layer sizes , the learning rates and the factorization rates , by checking the mean average error for the validation set by grid search . We trained our model using the Adam optimizer
BIBREF26
. All models were implemented with Pytorch
BIBREF27
.

Results

We compared our proposed method with three baseline methods . Concat fusion ( CF ) BIBREF6 proposes a simple concatenation of the different modalities followed by a linear combination . The tensor fusion approach ( TF ) BIBREF5 computes a tensor including uni - modal , bi - modal , and tri - modal combination information . LMF
BIBREF11
is a tensor fusion method that performs tensor factorization using the same rank for all the modalities in order to reduce the number of parameters .
Our proposed method aims to use different factors for each modality .

In Table
TABREF12
, we present mean average error ( MAE ) , the correlation between prediction and true scores , accuracy and F1 measure .
The proposed approach outperforms baseline approaches in nearly all metrics , with marked improvements in Happy and Neutral recognition . The reason is that the inter - modality information for these emotions is more complicated than the other emotions and requires a non - diagonal core tensor to extract the complicated information .

Investigating the Effect of Factorization Rate on Each Modality

In this section , we aim to investigate the amount of redundant information in each modality . To do this , after obtaining a tensor which includes the combinations of all modalities with the equivalent size , we factorize a single dimension of the tensor while keeping the size for the other modalities fixed . By observing how the performance changes by factorization rate , one can find how much redundant information is contained in the corresponding modality relative to the other modalities .

The results can be seen in Fig .
FIGREF14
,
FIGREF15
and
FIGREF16
. The horizontal axis is the ratio of compressed size over the original size for a single modality ( factorizationrate=Compressedsize Originalsize=r i d i ) , and the vertical axis shows the accuracy for each modality .

Fig .
FIGREF14
shows the results for the IEMOCAP emotion recognition dataset in 4 columns for the four emotional categories including happy , angry , sad , and neutral . The first point that could be perceived clearly from the different modality diagrams is that each of the modalities has a different optimum compression rate , ( the maximum accuracy is highlighted in each of the diagrams ) , which means they each have a different amount of the redundant information . In other words , a high accuracy for a small factorization rate means that there is a lot of redundant information in this modality . The information loss resulting from this factorization could be compensated by the other modalities thus avoiding performance reduction . For example , looking at the sad category , we see higher optimal factor sizes than the angry category apart from the video modality , which is more informative for the angry category than the sad category . This observation is supported by the results obtained in
BIBREF14
.

Moving on to the neutral category , optimal factorization rates are smaller ( a higher compression rate ) for video and language modalities . We know that the neutral category is very difficult to predict by these modalities in comparison to the other categories which means that these modalities are not that informative for the neutral category . The happy category suffers a lot by smaller factors ( higher compression rate ) which we can interpret that all the modalities include some useful information for this category .

Fig .
FIGREF14
shows results for the CMU - MOSI sentiment analysis dataset . For this dataset also , the first point that could be perceived clearly is that each modality has a different optimum compression rate , which means there is a different amount of the non - redundant information in each modality . In addition , we can see that the language modality can not be compressed very much and includes little redundant information for the current task .

Fig .
FIGREF16
shows the results for the POM personality trait recognition dataset . For this dataset also , each of the modalities has a different optimum compression rate ( the maximum accuracy is highlighted in each of the diagrams ) , meaning they have differing levels of the non - redundant information . Moreover , we can see that the visual modality includes more non - redundant information for the personality recognition , which is supported by other recent publications
BIBREF2
.

In Figures
FIGREF14
,
FIGREF15
and
FIGREF16
, the accuracy curves under different compression rates show a trend of increasing first and then decreasing . This phenomenon has some logical reasons . If The factor is too small and the resulted model is too simple , and tends to underfit . On the other hand , if the factor is too large and the model is not compressed , it is too large with many parameters and is prone to overfitting . This supports the supposition that our factorization method functions as a regularizer . Therefore , the accuracy is lower at the beginning and the end of the factorization spectrum .

Conclusion

We proposed a tensor fusion method for multimodal media analysis by obtaining a M+1 dimensional tensor to consider the high - order relationships between M input modalities and the output layer . Our modality - based factorization method removes the redundant information in this high - order dependency structure and leads to the fewer parameters with minimal loss of information .

The Modality - based Redundancy Reduction multimodal Fusion ( MRRF ) works as a regularizer which leads to a less complicated model and avoids overfitting . In addition , a modality - based factorization approach helps to figure out the amount of non - redundant useful information in each individual modality through investigation of optimal modality - specific compression rates .

We have provided experimental results for combining acoustic , text , and visual modalities for three different tasks : sentiment analysis , personality trait recognition , and emotion recognition . We have seen that the modality - based tensor compression approach improves the results in comparison to the simple concatenation method , the tensor fusion method and tensor fusion using the same factorization rank for all modalities , as proposed in the LMF method . In other words , the proposed method enjoys the same benefits as the tensor fusion method and avoids suffering from having a large number of parameters , which leads to a more complex model , needs many training samples and is prone to overfitting . We have evaluated our method by comparing the results with state - of - the - art methods , achieving a 1 % to 4 % improvement across multiple measures for the different tasks .

Moreover , we have investigated the effect of the compression rate on single modalities while fixing the other modalities helping to understand the amount of useful non - redundant information in each modality .

In future , as the availabality of data with more and more modalities increases , both finding a trade - off between cost and performance and effective and efficient utilisation of available modalities will be vital .

To be specific , does adding more modalities result in new information ?

If so , does the amount of performance improvement worth the resulting computational and memory cost ?

Accordingly , exploring the compression and factorization methods could help removing highly redundant modalities .

data fusion a desirable method for many machine learning tasks where information is
Multimodal fusion a very broad including audio - visual speech recognition BIBREF0 multimodal emotion recognition BIBREF1 , medical analysis
Late combines the decision for
modality a - based factorization method , results in removing the redundant information in the aforementioned high - order dependency structure and leads to fewer parameters with minimal information loss
dimensional tensor a tensor product of the INLINEFORM2 different modality representations followed by a flattening operation to include 1-st inter
the outer product operator , INLINEFORM1 s are scalar weights to combine rank 1 decompositions
operator × k a k - mode product of a tensor INLINEFORM1 and a matrix INLINEFORM2 as INLINEFORM3 , which results in a tensor INLINEFORM4
operator × k k k - mode product of a tensor INLINEFORM6 and a matrix INLINEFORM7 as INLINEFORM8 , which results in a tensor INLINEFORM9
method INLINEFORM0 dimensional tensor to a core tensor INLINEFORM3 matrices
non - diagonal core INLINEFORM0 maintain inter - modality information after compression , despite the factorization proposed which loses part of inter - modality information
Initialization factorization size for each modality INLINEFORM0
intra - modality information inter
dataset consists of three modalities , namely language , visual , and acoustic
visual taking the average of their feature values over the word time interval BIBREF21
representations encodes a sequence of the transcribed words into a sequence of vectors
sub - embedding network a simple feed - forward neural , and language
BIBREF11
a tensor fusion method that performs tensor factorization using the same rank for all the modalities in order to reduce the number of parameters
horizontal axis the ratio of compressed size over the original size for a single modality ( INLINEFORM0 )
vertical the accuracy for each modality
modality - based the information in - order dependency and to parameters
Modality - based Redundancy Reduction multimodal Fusion ( MRRF ) a regularizer leads to a less complicated model and avoids overfitting
Chunk-Based Bi-Scale Decoder for Neural Machine Translation 1705.01452 2017 P17-2092
recurrent neural networks
lexical component
machine translation
lexical information
recurrent neural network

Introduction

Recent work of neural machine translation ( NMT ) models propose to adopt the encoder - decoder framework for machine translation BIBREF0 , BIBREF1 , BIBREF2 , which employs a recurrent neural network ( RNN ) encoder to model the source context information and a RNN decoder to generate translations , which is significantly different from previous statistical machine translation systems BIBREF3 , BIBREF4 . This framework is then extended by an attention mechanism , which acquires source sentence context dynamically at different decoding steps BIBREF5 , BIBREF6 .

The decoder state stores translation information at different granularities , determining which segment should be expressed ( phrasal ) , and which word should be generated ( lexical ) , respectively . However , due to the extensive existence of multi - word phrases and expressions , the varying speed of the lexical component is much faster than the phrasal one . As in the generation of “ the French Republic " , the lexical component in the decoder will change thrice , each of which for a separate word . But the phrasal component may only change once . The inconsistent varying speed of the two components may cause translation errors .

Typical NMT model generates target sentences in the word level , packing the phrasal and lexical information in one hidden state , which is not necessarily the best for translation . Much previous work propose to improve the NMT model by adopting fine - grained translation levels such as the character or sub - word levels , which can learn the intermediate information inside words BIBREF7 , BIBREF8 , BIBREF9 , BIBREF10 , BIBREF11 , BIBREF12 , BIBREF13 , BIBREF14 . However , high level structures such as phrases has not been explicitly explored in NMT , which is very useful for machine translation BIBREF15 .

We propose a chunk - based bi - scale decoder for NMT , which explicitly splits the lexical and phrasal components into different time - scales . The proposed model generates target words in a hierarchical way , which deploys a standard word time - scale RNN ( lexical modeling ) on top of an additional chunk time - scale RNN ( phrasal modeling ) . At each step of decoding , our model first predict a chunk state with a chunk attention , based on which multiple word states are generated without attention . The word state is updated at every step , while the chunk state is only updated when the chunk boundary is detected by a boundary gate automatically . In this way , we incorporate soft phrases into NMT , which makes the model flexible at capturing both global reordering of phrases and local translation inside phrases . Our model has following benefits :

neural machine translation NMT encoder decoder machine
recurrent neural network RNN encoder to model the source context information and a RNN decoder to generate translations significantly different from previous statistical machine translation systems
attention mechanism acquires source sentence context dynamically at different decoding steps BIBREF5
model generates target sentences in the word level packing the phrasal and lexical information in one hidden state
proposed model generates target a hierarchical way
- scale RNN deploys a standard word time lexical modeling ) on top of an additional chunk time - scale RNN
Neural Lattice Language Models 1803.05071 2018 Q18-1036
language models
lattice language
recurrent neural networks
language modeling
lattice language models
lattice language modeling
natural language processing

Introduction

Neural network models have recently contributed towards a great amount of progress in natural language processing . These models typically share a common backbone : recurrent neural networks ( RNN ) , which have proven themselves to be capable of tackling a variety of core natural language processing tasks
BIBREF0
,
BIBREF1
.
One such task is language modeling , in which we estimate a probability distribution over sequences of tokens that corresponds to observed sentences ( § SECREF2 ) . Neural language models , particularly models conditioned on a particular input , have many applications including in machine translation
BIBREF2
, abstractive summarization
BIBREF3
, and speech processing
BIBREF4
. Similarly , state - of - the - art language models are almost universally based on RNNs , particularly long short - term memory ( LSTM ) networks
BIBREF5
,
BIBREF6
,
BIBREF7
.


While powerful , LSTM language models usually do not explicitly model many commonly - accepted linguistic phenomena . As a result , standard models lack linguistically informed inductive biases , potentially limiting their accuracy , particularly in low - data scenarios
BIBREF8
,
BIBREF9
. In this work , we present a novel modification to the standard LSTM language modeling framework that allows us to incorporate some varieties of these linguistic intuitions seamlessly : neural lattice language models ( § SECREF9 ) . Neural lattice language models define a lattice over possible paths through a sentence , and maximize the marginal probability over all paths that lead to generating the reference sentence , as shown in Fig .
FIGREF2
. Depending on how we define these paths , we can incorporate different assumptions about how language should be modeled .

In the particular instantiations of neural lattice language models covered by this paper , we focus on two properties of language that could potentially be of use in language modeling : the existence of multi - word lexical units
BIBREF10
( § SECREF24 ) and polysemy
BIBREF11
( § SECREF31 ) .
Neural lattice language models allow the model to incorporate these aspects in an end - to - end fashion by simply adjusting the structure of the underlying lattices .

We run experiments to explore whether these modifications improve the performance of the model ( § SECREF5 ) . Additionally , we provide qualitative visualizations of the model to attempt to understand what types of multi - token phrases and polysemous embeddings have been learned .

Language Models

Consider a sequence X for which we want to calculate its probability . Assume we have a vocabulary from which we can select a unique list of |X| tokens x 1 ,x 2 ,...,x |X| such that X=[x 1 ;x 2 ;...;x |X| ] , i.e. the concatenation of the tokens ( with an appropriate delimiter ) . These tokens can be either on the character level
BIBREF12
,
BIBREF13
or word level
BIBREF6
,
BIBREF7
. Using the chain rule , language models generally factorize p(X) in the following way : DISPLAYFORM0

Note that this factorization is exact only in the case where the segmentation is unique . In character - level models , it is easy to see that this property is maintained , because each token is unique and non - overlapping . In word - level models , this also holds , because tokens are delimited by spaces , and no word contains a space .

Recurrent Neural Networks

Recurrent neural networks have emerged as the state - of - the - art approach to approximating p(X) . In particular , the LSTM cell
BIBREF0
is a specific RNN architecture which has been shown to be effective on many tasks , including language modeling
BIBREF14
,
BIBREF5
,
BIBREF7
,
BIBREF6
. LSTM language models recursively calculate the hidden and cell states ( h t and c t respectively ) given the input embedding e t-1 corresponding to token x t-1 : DISPLAYFORM0

then calculate the probability of the next token given the hidden state , generally by performing an affine transform parameterized by W and b , followed by a softmax : DISPLAYFORM0

Language Models with Ambiguous Segmentations

To reiterate , the standard formulation of language modeling in the previous section requires splitting sentence X into a unique set of tokens x 1 ,...,x |X| . Our proposed method generalizes the previous formulation to remove the requirement of uniqueness of segmentation , similar to that used in non - neural n -gram language models such as dupont1997lattice and goldwater2007distributional .

First , we define some terminology . We use the term “ token ” , designated by x i , to describe any indivisible item in our vocabulary that has no other vocabulary item as its constituent part . We use the term “ chunk ” , designated by k i or x i j , to describe a sequence of one or more tokens that represents a portion of the full string X , containing the unit tokens x i through x j : x i j =[x i ,x i+1 ;...;x j ] . We also refer to the “ token vocabulary ” , which is the subset of the vocabulary containing only tokens , and to the “ chunk vocabulary ” , which similarly contains all chunks .

Note that we can factorize the probability of any sequence of chunks K using the chain rule , in precisely the same way as sequences of tokens : DISPLAYFORM0

We can factorize the overall probability of a token list X in terms of its chunks by using the chain rule , and marginalizing over all segmentations . For any particular token list X , we define a set of valid segmentations 𝒮(X) , such that for every sequence s∈𝒮(X) , X=[x s 0 s 1 -1 ;x s 1 s 2 -1 ;...;x s |s|-1 s |s| ] . The factorization is : DISPLAYFORM0



Note that , by definition , there exists a unique segmentation of X such that x 1 ,x 2 ,... are all tokens , in which case |S|=|X| . When only that one unique segmentation is allowed per X , 𝒮 contains only that one element , so summation drops out , and therefore for standard character - level and word - level models , Eq . ( EQREF11 ) reduces to Eq . ( EQREF4 ) , as desired . However , for models that license multiple segmentations per X , computing this marginalization directly is generally intractable . For example , consider segmenting a sentence using a vocabulary containing all words and all 2-word expressions . The size of 𝒮 would grow exponentially with the number of words in X , meaning we would have to marginalize over trillions of unique segmentations for even modestly - sized sentences .

Lattice Language Models

To avoid this , it is possible to re - organize the computations in a lattice , which allows us to dramatically reduce the number of computations required
BIBREF15
,
BIBREF16
.

All segmentations of X can be expressed as the edges of paths through a lattice over token - level prefixes of X : x <1 ,x <2 ,...,X . The infimum is the empty prefix x <1 ; the supremum is X ; an edge from prefix x to prefix x exists if and only if there exists a chunk x i j in our chunk vocabulary such that [x . Each path through the lattice from x <1 to X is a segmentation of X into the list of tokens on the traversed edges , as seen in Fig .
FIGREF2
.

The probability of a specific prefix p(x is calculated by marginalizing over all segmentations leading up to x j-1 DISPLAYFORM0

where by definition s |S| =j . The key insight here that allows us to calculate this efficiently is that this is a recursive formula and that instead of marginalizing over all segmentations , we can marginalize over immediate predecessor edges in the lattice , A j . Each item in A j is a location i ( =s t-1 ) , which indicates that the edge between prefix x and prefix x , corresponding to token x i j , exists in the lattice . We can thus calculate p(x as DISPLAYFORM0

Since X is the supremum prefix node , we can use this formula to calculate p(X) by setting j=|X| . In order to do this , we need to calculate the probability of each of its |X| predecessors . Each of those takes up to |X| calculations , meaning that the computation for p(X) can be done in O ( |X| 2 ) time . If we can guarantee that each node will have a maximum number of incoming edges D so that |A j |≤D for all j , then this bound can be reduced to O ( D|X| ) time .

The proposed technique is completely agnostic to the shape of the lattice , and Fig .
FIGREF17
illustrates several potential varieties of lattices . Depending on how the lattice is constructed , this approach can be useful in a variety of different contexts , two of which we discuss in § SECREF4 .

Neural Lattice Language Models

There is still one missing piece in our attempt to apply neural language models to lattices . Within our overall probability in Eq . ( EQREF14 ) , we must calculate the probability p(x i j ∣x of the next segment given the history . However , given that there are potentially an exponential number of paths through the lattice leading to x i , this is not as straightforward as in the case where only one segmentation is possible . Previous work on lattice - based language models
BIBREF16
,
BIBREF15
utilized count - based n -gram models , which depend on only a limited historical context at each step making it possible to compute the marginal probabilities in an exact and efficient manner through dynamic programming . On the other hand , recurrent neural models depend on the entire context , causing them to lack this ability . Our primary technical contribution is therefore to describe several techniques for incorporating lattices into a neural framework with infinite context , by providing ways to approximate the hidden state of the recurrent neural net .

One approach to approximating the hidden state is the TreeLSTM framework described by tai2015improved . In the TreeLSTM formulation , new states are derived from multiple predecessors by simply summing the individual hidden and cell state vectors of each of them . For each predecessor location i∈A j , we first calculate the local hidden state h ˜ and local cell state c ˜ by combining the embedding e i j with the hidden state of the LSTM at x using the standard LSTM update function as in Eq . ( EQREF7 ): h ˜ i ,c ˜ i =LSTM(h i ,c i ,e i j ,θ)fori∈A j

We then sum the local hidden and cell states : h j =∑ i∈A j h ˜ i c j =∑ i∈A j c ˜ i

This formulation is powerful , but comes at the cost of sacrificing the probabilistic interpretation of which paths are likely . Therefore , even if almost all of the probability mass comes through the “ true ” segmentation , the hidden state may still be heavily influenced by all of the “ bad ” segmentations as well .

Another approximation that has been proposed is to sample one predecessor state from all possible predecessors , as seen in chan2016latent . We can calculate the total probability that we reach some prefix x , and we know how much of this probability comes from each of its predecessors in the lattice , so we can construct a probability distribution over predecessors in the lattice : DISPLAYFORM0

Therefore , one way to update the LSTM is to sample one predecessor x from the distribution M and simply set h j =h ˜ i and c j =c ˜ i . However , sampling is unstable and difficult to train : we found that the model tended to over - sample short tokens early on during training , and thus segmented every sentence into unigrams . This is similar to the outcome reported by chan2016latent , who accounted for it by incorporating an ϵ encouraging exploration .

In another approach , which allows us to incorporate information from all predecessors while maintaining a probabilistic interpretation , we can utilize the probability distribution M to instead calculate the expected value of the hidden state : h j =𝐄 x

The Gumbel - Softmax trick , or concrete distribution , described by jang2016categorical and maddison2016concrete , is a technique for incorporating discrete choices into differentiable neural computations . In this case , we can use it to select a predecessor . The Gumbel - Softmax trick works by taking advantage of the fact that adding Gumbel noise to the pre - softmax predecessor scores and then taking the argmax is equivalent to sampling from the probability distribution . By replacing the argmax with a softmax function scaled by a temperature τ , we can get this pseudo - sampled distribution through a fully differentiable computation : N(x



This new distribution can then be used to calculate the hidden state by taking a weighted average of the states of possible predecessors : h j =∑ i∈A j j-1 N(x

When τ is large , the values of N(x are flattened out ; therefore , all the predecessor hidden states are summed with approximately equal weight , equivalent to the direct approximation ( § UID18 ) . On the other hand , when τ is small , the output distribution becomes extremely peaky , and one predecessor receives almost all of the weight . Each predecessor x has a chance of being selected equal to M(x , which makes it identical to ancestral sampling ( § UID20 ) . By slowly annealing the value of τ , we can smoothly interpolate between these two approaches , and end up with a probabilistic interpretation that avoids the instability of pure sampling - based approaches .

Instantiations of Neural Lattice LMs

In this section , we introduce two instantiations of neural lattice languages models aiming to capture features of language : the existence of coherent multi - token chunks , and the existence of polysemy .

Incorporating Multi-Token Phrases

Natural language phrases often demonstrate significant non - compositionality : for example , in English , the phrase “ rock and roll ” is a genre of music , but this meaning is not obtained by viewing the words in isolation . In word - level language modeling , the network is given each of these words as input , one at a time ; this means it must capture the idiomaticity in its hidden states , which is quite roundabout and potentially a waste of the limited parameters in a neural network model . A straightforward solution is to have an embedding for the entire multi - token phrase , and use this to input the entire phrase to the LSTM in a single timestep . However , it is also important that the model is able to decide whether the non - compositional representation is appropriate given the context : sometimes , “ rock ” is just a rock .

Additionally , by predicting multiple tokens in a single timestep , we are able to decrease the number of timesteps across which the gradient must travel , making it easier for information to be propagated across the sentence . This is even more useful in non - space - delimited languages such as Chinese , in which segmentation is non - trivial , but character - level modeling leads to many sentences being hundreds of tokens long .

There is also psycho - linguistic evidence which supports the fact that humans incorporate multi - token phrases into their mental lexicon . siyanova2011adding show that native speakers of a language have significantly reduced response time when processing idiomatic phrases , whether they are used in an idiomatic sense or not , while bannard2008stored show that children learning a language are better at speaking common phrases than uncommon ones . This evidence lends credence to the idea that multi - token lexical units are a useful tool for language modeling in humans , and so may also be useful in computational models .

The underlying lattices utilized in our multi - token phrase experiments are “ dense ” lattices : lattices where every edge ( below a certain length L ) is present ( Fig .
FIGREF17
, c ) . This is for two reasons . First , since every sequence of tokens is given an opportunity to be included in the path , all segmentations are candidates , which will potentially allow us to discover arbitrary types of segmentations without a prejudice towards a particular theory of which multi - token units we should be using . Second , using a dense lattice makes minibatching very straightforward by ensuring that the computation graphs for each sentence are identical . If the lattices were not dense , the lattices of various sentences in a minibatch could be different ; it then becomes necessary to either calculate a differently - shaped graph for every sentence , preventing minibatching and hurting training efficiency , or calculate and then mask out the missing edges , leading to wasted computation . Since only edges of length L or less are present , the maximum in - degree of any node in the lattice D is no greater than L , giving us the time bound O ( L|X| ) .

Storing an embedding for every possible multi - token chunk would require |V| L unique embeddings , which is intractable . Therefore , we construct our multi - token embeddings by merging compositional and non - compositional representations .

We first establish a priori a set of “ core ” chunk - level tokens that each have a dense embedding . In order to guarantee full coverage of sentences , we first add every unit - level token to this vocabulary , e.g. every word in the corpus for a word - level model . Following this , we also add the most frequent n - grams ( where 1 ) . This ensures that the vast majority of sentences will have several longer chunks appear within them , and so will be able to take advantage of tokens at larger granularities .

However , the non - compositional embeddings above only account for a subset of all n -grams , so we additionally construct compositional embeddings for each chunk by running a BiLSTM encoder over the individual embeddings of each unit - level token within it
BIBREF17
.
In this way , we can create a unique embedding for every sequence of unit - level tokens .

We use this composition function on chunks regardless of whether they are assigned non - compositional embeddings or not , as even high - frequency chunks may display compositional properties . Thus , for every chunk , we compute the chunk embedding vector x i j by concatenating the compositional embedding with the non - compositional embedding if it exists , or otherwise with an < UNK > embedding .

At each timestep , we want to use our LSTM hidden state h t to assign some probability mass to every chunk with length less than L . To do this , we follow merity2016pointer in creating a new “ sentinel ” token <𝑠> and adding it to our vocabulary . At each timestep , we first use our neural network to calculate a score for each chunk C in our vocabulary , including the sentinel token . We do a softmax across these scores to assign a probability p 𝑚𝑎𝑖𝑛 (C t+1 ∣h t ;θ) to every chunk in our vocabulary , and also to <𝑠> . For token sequences not represented in our chunk vocabulary , this probability p 𝑚𝑎𝑖𝑛 (C t+1 ∣h t ;θ)=0 .

Next , the probability mass assigned to the sentinel value , p 𝑚𝑎𝑖𝑛 (<𝑠>∣h t ;θ) , is distributed across all possible tokens sequences of length less than L , using another LSTM with parameters θ 𝑠𝑢𝑏 . Similar to jozefowicz2016exploring , this sub - LSTM is initialized by passing in the hidden state of the main lattice LSTM at that timestep . This gives us a probability for each sequence p 𝑠𝑢𝑏 (c 1 ,c 2 ,...,c L ∣h t ;θ 𝑠𝑢𝑏 ) .

The final formula for calculating the probability mass assigned to a specific chunk C is : p(C∣h t ;θ)=p 𝑚𝑎𝑖𝑛 (C∣h t ;θ)+p 𝑚𝑎𝑖𝑛 (<𝑠>∣h t ;θ)p 𝑠𝑢𝑏 (C∣h t ;θ 𝑠𝑢𝑏 )

Incorporating Polysemous Tokens

A second shortcoming of current language modeling approaches is that each word is associated with only one embedding . For highly polysemous words , a single embedding may be unable to represent all meanings effectively .

There has been past work in word embeddings which has shown that using multiple embeddings for each word is helpful in constructing a useful representation . athiwaratkun2017multimodal represented each word with a multimodal Gaussian distribution and demonstrated that embeddings of this form were able to outperform more standard skip - gram embeddings on word similarity and entailment tasks . Similarly , DBLP : journals / corr / ChenQJH15 incorporate standard skip - gram training into a Gaussian mixture framework and show that this improves performance on several word similarity benchmarks .

When a polysemous word is represented using only a single embedding in a language modeling task , the multimodal nature of the true embedding distribution may causes the resulting embedding to be both high - variance and skewed from the positions of each of the true modes . Thus , it is likely useful to represent each token with multiple embeddings when doing language modeling .

For our polysemy experiments , the underlying lattices are “ multilattices ” : lattices which are also multigraphs , and can have any number of edges between any given pair of nodes ( Fig .
FIGREF17
, d ) . Lattices set up in this manner allow us to incorporate multiple embeddings for each word . Within a single sentence , any pair of nodes corresponds to the start and end of a particular subsequence of the full sentence , and is thus associated with a specific token ; each edge between them is a unique embedding for that token . While many strategies for choosing the number of embeddings exist in the literature
BIBREF18
, in this work , we choose a number of embeddings E and assign that many embeddings to each word . This ensures that the maximum in - degree of any node in the lattice D , is no greater than E , giving us the time bound O ( E|X| ) .

In this work , we do not explore models that include both chunk vocabularies and multiple embeddings . However , combining these two techniques , as well as exploring other , more complex lattice structures , is an interesting avenue for future work .

Data

We perform experiments on two languages : English and Chinese , which provide an interesting contrast in linguistic features .

In English , the most common benchmark for language modeling recently is the Penn Treebank , specifically the version preprocessed by mikolovptb . However , this corpus is limited by being relatively small , only containing approximately 45,000 sentences , which we found to be insufficient to effectively train lattice language models . Thus , we instead used the Billion Word Corpus
BIBREF19
. Past experiments on the BWC typically modeled every word without restricting the vocabulary , which results in a number of challenges regarding the modeling of open vocabularies that are orthogonal to this work . Thus , we create a preprocessed version of the data in the same manner as Mikolov , lowercasing the words , replacing numbers with < N > tokens , and < UNK > ing all words beyond the ten thousand most common . Additionally , we restricted the data set to only include sentences of length 50 or less , ensuring that large minibatches could fit in GPU memory . Our subsampled English corpus contained 29,869,166 sentences , of which 29,276,669 were used for training , 5,000 for validation , and 587,497 for testing . To validate that our methods scale up to larger language modeling scenarios , we also report a smaller set of large - scale experiments on the full billion word benchmark in Appendix A.

In Chinese , we ran experiments on a subset of the Chinese GigaWord corpus . Chinese is also particularly interesting because unlike English , it does not use spaces to delimit words , so segmentation is non - trivial . Therefore , we used a character - level language model for the baseline , and our lattice was composed of multi - character chunks . We used sentences from Guangming Daily , again < UNK > ing all but the 10,000 most common tokens and restricting the selected sentences to only include sentences of length 150 or less . Our subsampled Chinese corpus included 934,101 sentences for training , 5,000 for validation , and 30,547 for testing .

Main Experiments

We compare a baseline LSTM model , dense lattices of size 1 , 2 , and 3 , and a multilattice with 2 and 3 embeddings per word .

The implementation of our networks was done in DyNet
BIBREF20
. All LSTMs had 2 layers , each with hidden dimension of 200 . Variational dropout
BIBREF21
of .2 was used on the Chinese experiments , but hurt performance on the English data , so it was not used . The 10,000 word embeddings each had dimension 256 . For lattice models , chunk vocabularies were selected by taking the 10,000 words in the vocabulary and adding the most common 10,000 n -grams with 1 . The weights on the final layer of the network were tied with the input embeddings , as done by
BIBREF14
,
BIBREF6
. In all lattice models , hidden states were computed using weighted expectation ( § UID22 ) unless mentioned otherwise . In multi - embedding models , embedding sizes were decreased so as to maintain the same total number of parameters . All models were trained using the Adam optimizer with a learning rate of .01 on a NVIDIA K80 GPU . The results can be seen in Table
TABREF38
and Table
TABREF39
.

In the multi - token phrase experiments , many additional parameters are accrued by the BiLSTM encoder and sub - LSTM predictive model , making them not strictly comparable to the baseline . To account for this , we include results for L=1 , which , like the baseline LSTM approach , fails to leverage multi - token phrases , but includes the same number of parameters as L=2 and L=3 .

In both the English and Chinese experiments , we see the same trend : increasing the maximum lattice size decreases the perplexity , and for L=2 and above , the neural lattice language model outperforms the baseline . Similarly , increasing the number of embeddings per word decreases the perplexity , and for E=2 and above , the multiple - embedding model outperforms the baseline .

Hidden State Calculation Experiments

We compare the various hidden - state calculation approaches discussed in Section SECREF16 on the English data using a lattice of size L=2 and dropout of .2 . These results can be seen in Table
TABREF42
.

For all hidden state calculation techniques , the neural lattice language models outperform the LSTM baseline . The ancestral sampling technique used by chan2016latent is worse than the others , which we found to be due to it getting stuck in a local minimum which represents almost everything as unigrams . There is only a small difference between the perplexities of the other techniques .

Discussion and Analysis

Neural lattice language models convincingly outperform an LSTM baseline on the task of language modeling . One interesting note is that in English , which is already tokenized into words and highly polysemous , utilizing multiple embeddings per word is more effective than including multi - word tokens . In contrast , in the experiments on the Chinese data , increasing the lattice size of the multi - character tokens is more important than increasing the number of embeddings per character . This corresponds to our intuition ; since Chinese is not tokenized to begin with , utilizing models that incorporate segmentation and compositionality of elementary units is very important for effective language modeling .

To calculate the probability of a sentence , the neural lattice language model implicitly marginalizes across latent segmentations . By inspecting the probabilities assigned to various edges of the lattice , we can visualize these segmentations , as is done in Fig .
FIGREF41
. The model successfully identifies bigrams which correspond to non - compositional compounds , like “ prime minister ” , and bigrams which correspond to compositional compounds , such as “ a quarter ” . Interestingly , this does not occur for all high - frequency bigrams ; it ignores those that are not inherently meaningful , such as “ < UNK > in ” , yielding qualitatively good phrases .

In the multiple - embedding experiments , it is possible to see which of the two embeddings of a word was assigned the higher probability for any specific test - set sentence . In order to visualize what types of meanings are assigned to each embedding , we select sentences in which one embedding is preferred , and look at the context in which the word is used . Several examples of this can be seen in Table
of the...', 'left-border': True, 'right-border': True, 'text': '...on page < unk > of the...'}, { 'alignment': 'left', 'latex': '...was it front page news...', 'left-border': False, 'right-border': True, 'text': '...was it front page news...'}], 'top-border': False}, { 'bottom-border': True, 'cells': [ { 'alignment': 'left', 'latex': '...a source told page six ....', 'left-border': True, 'right-border': True, 'text': '...a source told page six ....'}, { 'alignment': 'left', 'latex': '...himself , tony page , the ' 'former ...', 'left-border': False, 'right-border': True, 'text': '...himself , tony page , the ' 'former ...'}], 'top-border': False}, { 'bottom-border': True, 'cells': [ { 'alignment': 'left', 'latex': '...on page < unk > of the...', 'left-border': True, 'right-border': True, 'text': '...on page < unk > of the...'}, { 'alignment': 'left', 'latex': '...sections of the page that ' 'discuss...', 'left-border': False, 'right-border': True, 'text': '...sections of the page that ' 'discuss...'}], 'top-border': False}, { 'bottom-border': True, 'cells': [ { 'alignment': 'left', 'latex': 'profile 1 _1', 'left-border': True, 'right-border': True, 'text': 'profile 1 _1'}, { 'alignment': 'left', 'latex': 'profile 2 _2', 'left-border': False, 'right-border': True, 'text': 'profile 2 _2'}], 'top-border': False}, { 'bottom-border': True, 'cells': [ { 'alignment': 'left', 'latex': '...( < unk > : quote , profile ' ', research )...', 'left-border': True, 'right-border': True, 'text': '...( < unk > : quote , profile ' ', research )...'}, { 'alignment': 'left', 'latex': '...so < unk > the profile of ' 'the city...', 'left-border': False, 'right-border': True, 'text': '...so < unk > the profile of ' 'the city...'}], 'top-border': False}, { 'bottom-border': True, 'cells': [ { 'alignment': 'left', 'latex': '...( < unk > : quote , profile ' ', research )...', 'left-border': True, 'right-border': True, 'text': '...( < unk > : quote , profile ' ', research )...'}, { 'alignment': 'left', 'latex': '...the highest profile < unk > ' 'held by...', 'left-border': False, 'right-border': True, 'text': '...the highest profile < unk > ' 'held by...'}], 'top-border': False}, { 'bottom-border': True, 'cells': [ { 'alignment': 'left', 'latex': '...( < unk > : quote , profile ' ', research )...', 'left-border': True, 'right-border': True, 'text': '...( < unk > : quote , profile ' ', research )...'}, { 'alignment': 'left', 'latex': '...from high i , elite schools ' ',...', 'left-border': False, 'right-border': True, 'text': '...from high i , elite schools ' ',...'}], 'top-border': False}, { 'bottom-border': True, 'cells': [ { 'alignment': 'left', 'latex': 'edition 1 _1', 'left-border': True, 'right-border': True, 'text': 'edition 1 _1'}, { 'alignment': 'left', 'latex': 'edition 2 _2', 'left-border': False, 'right-border': True, 'text': 'edition 2 _2'}], 'top-border': False}, { 'bottom-border': True, 'cells': [ { 'alignment': 'left', 'latex': '... of the second edition of ' 'windows...', 'left-border': True, 'right-border': True, 'text': '... of the second edition of ' 'windows...'}, { 'alignment': 'left', 'latex': '...of the new york edition . ' '...', 'left-border': False, 'right-border': True, 'text': '...of the new york edition . ' '...'}], 'top-border': False}, { 'bottom-border': True, 'cells': [ { 'alignment': 'left', 'latex': "... this month 's edition of < " 'unk > , the ...', 'left-border': True, 'right-border': True, 'text': "... this month 's edition of < " 'unk > , the ...'}, { 'alignment': 'left', 'latex': '...of the new york edition . ' '...', 'left-border': False, 'right-border': True, 'text': '...of the new york edition . ' '...'}], 'top-border': False}, { 'bottom-border': True, 'cells': [ { 'alignment': 'left', 'latex': '...forthcoming d.c. edition of ' 'the hit...', 'left-border': True, 'right-border': True, 'text': '...forthcoming d.c. edition of ' 'the hit...'}, { 'alignment': 'left', 'latex': '...of the new york edition . ' '...', 'left-border': False, 'right-border': True, 'text': '...of the new york edition . ' '...'}], 'top-border': False}, { 'bottom-border': True, 'cells': [ { 'alignment': 'left', 'latex': 'rodham 1 _1', 'left-border': True, 'right-border': True, 'text': 'rodham 1 _1'}, { 'alignment': 'left', 'latex': 'rodham 2 _2', 'left-border': False, 'right-border': True, 'text': 'rodham 2 _2'}], 'top-border': False}, { 'bottom-border': True, 'cells': [ { 'alignment': 'left', 'latex': '...senators hillary rodham ' 'clinton and...', 'left-border': True, 'right-border': True, 'text': '...senators hillary rodham ' 'clinton and...'}, { 'alignment': 'left', 'latex': '', 'left-border': False, 'right-border': True, 'text': ''}], 'top-border': False}, { 'bottom-border': True, 'cells': [ { 'alignment': 'left', 'latex': '...making hillary rodham ' 'clinton his...', 'left-border': True, 'right-border': True, 'text': '...making hillary rodham ' 'clinton his...'}, { 'alignment': 'left', 'latex': '', 'left-border': False, 'right-border': True, 'text': ''}], 'top-border': False}, { 'bottom-border': True, 'cells': [ { 'alignment': 'left', 'latex': "...hillary rodham clinton 's " 'campaign has...', 'left-border': True, 'right-border': True, 'text': "...hillary rodham clinton 's " 'campaign has...'}, { 'alignment': 'left', 'latex': '', 'left-border': False, 'right-border': True, 'text': ''}], 'top-border': False}], 'ref_id': 'TABREF44', 'text': '4', 'type': 'table'}" style=display:inline-block;>TABREF44
; it is clear from looking at these examples that the system does learn distinct embeddings for different senses of the word . What is interesting , however , is that it does not necessarily learn intuitive semantic meanings ; instead it tends to group the words by the context in which they appear . In some cases , like profile and edition , one of the two embeddings simply captures an idiosyncrasy of the training data .

Additionally , for some words , such as rodham in Table
of the...', 'left-border': True, 'right-border': True, 'text': '...on page < unk > of the...'}, { 'alignment': 'left', 'latex': '...was it front page news...', 'left-border': False, 'right-border': True, 'text': '...was it front page news...'}], 'top-border': False}, { 'bottom-border': True, 'cells': [ { 'alignment': 'left', 'latex': '...a source told page six ....', 'left-border': True, 'right-border': True, 'text': '...a source told page six ....'}, { 'alignment': 'left', 'latex': '...himself , tony page , the ' 'former ...', 'left-border': False, 'right-border': True, 'text': '...himself , tony page , the ' 'former ...'}], 'top-border': False}, { 'bottom-border': True, 'cells': [ { 'alignment': 'left', 'latex': '...on page < unk > of the...', 'left-border': True, 'right-border': True, 'text': '...on page < unk > of the...'}, { 'alignment': 'left', 'latex': '...sections of the page that ' 'discuss...', 'left-border': False, 'right-border': True, 'text': '...sections of the page that ' 'discuss...'}], 'top-border': False}, { 'bottom-border': True, 'cells': [ { 'alignment': 'left', 'latex': 'profile 1 _1', 'left-border': True, 'right-border': True, 'text': 'profile 1 _1'}, { 'alignment': 'left', 'latex': 'profile 2 _2', 'left-border': False, 'right-border': True, 'text': 'profile 2 _2'}], 'top-border': False}, { 'bottom-border': True, 'cells': [ { 'alignment': 'left', 'latex': '...( < unk > : quote , profile ' ', research )...', 'left-border': True, 'right-border': True, 'text': '...( < unk > : quote , profile ' ', research )...'}, { 'alignment': 'left', 'latex': '...so < unk > the profile of ' 'the city...', 'left-border': False, 'right-border': True, 'text': '...so < unk > the profile of ' 'the city...'}], 'top-border': False}, { 'bottom-border': True, 'cells': [ { 'alignment': 'left', 'latex': '...( < unk > : quote , profile ' ', research )...', 'left-border': True, 'right-border': True, 'text': '...( < unk > : quote , profile ' ', research )...'}, { 'alignment': 'left', 'latex': '...the highest profile < unk > ' 'held by...', 'left-border': False, 'right-border': True, 'text': '...the highest profile < unk > ' 'held by...'}], 'top-border': False}, { 'bottom-border': True, 'cells': [ { 'alignment': 'left', 'latex': '...( < unk > : quote , profile ' ', research )...', 'left-border': True, 'right-border': True, 'text': '...( < unk > : quote , profile ' ', research )...'}, { 'alignment': 'left', 'latex': '...from high i , elite schools ' ',...', 'left-border': False, 'right-border': True, 'text': '...from high i , elite schools ' ',...'}], 'top-border': False}, { 'bottom-border': True, 'cells': [ { 'alignment': 'left', 'latex': 'edition 1 _1', 'left-border': True, 'right-border': True, 'text': 'edition 1 _1'}, { 'alignment': 'left', 'latex': 'edition 2 _2', 'left-border': False, 'right-border': True, 'text': 'edition 2 _2'}], 'top-border': False}, { 'bottom-border': True, 'cells': [ { 'alignment': 'left', 'latex': '... of the second edition of ' 'windows...', 'left-border': True, 'right-border': True, 'text': '... of the second edition of ' 'windows...'}, { 'alignment': 'left', 'latex': '...of the new york edition . ' '...', 'left-border': False, 'right-border': True, 'text': '...of the new york edition . ' '...'}], 'top-border': False}, { 'bottom-border': True, 'cells': [ { 'alignment': 'left', 'latex': "... this month 's edition of < " 'unk > , the ...', 'left-border': True, 'right-border': True, 'text': "... this month 's edition of < " 'unk > , the ...'}, { 'alignment': 'left', 'latex': '...of the new york edition . ' '...', 'left-border': False, 'right-border': True, 'text': '...of the new york edition . ' '...'}], 'top-border': False}, { 'bottom-border': True, 'cells': [ { 'alignment': 'left', 'latex': '...forthcoming d.c. edition of ' 'the hit...', 'left-border': True, 'right-border': True, 'text': '...forthcoming d.c. edition of ' 'the hit...'}, { 'alignment': 'left', 'latex': '...of the new york edition . ' '...', 'left-border': False, 'right-border': True, 'text': '...of the new york edition . ' '...'}], 'top-border': False}, { 'bottom-border': True, 'cells': [ { 'alignment': 'left', 'latex': 'rodham 1 _1', 'left-border': True, 'right-border': True, 'text': 'rodham 1 _1'}, { 'alignment': 'left', 'latex': 'rodham 2 _2', 'left-border': False, 'right-border': True, 'text': 'rodham 2 _2'}], 'top-border': False}, { 'bottom-border': True, 'cells': [ { 'alignment': 'left', 'latex': '...senators hillary rodham ' 'clinton and...', 'left-border': True, 'right-border': True, 'text': '...senators hillary rodham ' 'clinton and...'}, { 'alignment': 'left', 'latex': '', 'left-border': False, 'right-border': True, 'text': ''}], 'top-border': False}, { 'bottom-border': True, 'cells': [ { 'alignment': 'left', 'latex': '...making hillary rodham ' 'clinton his...', 'left-border': True, 'right-border': True, 'text': '...making hillary rodham ' 'clinton his...'}, { 'alignment': 'left', 'latex': '', 'left-border': False, 'right-border': True, 'text': ''}], 'top-border': False}, { 'bottom-border': True, 'cells': [ { 'alignment': 'left', 'latex': "...hillary rodham clinton 's " 'campaign has...', 'left-border': True, 'right-border': True, 'text': "...hillary rodham clinton 's " 'campaign has...'}, { 'alignment': 'left', 'latex': '', 'left-border': False, 'right-border': True, 'text': ''}], 'top-border': False}], 'ref_id': 'TABREF44', 'text': '4', 'type': 'table'}" style=display:inline-block;>TABREF44
, the system always prefers one embedding . This is promising , because it means that in future work it may be possible to further improve accuracy and training efficiency by assigning more embeddings to polysemous words , instead of assigning the same number of embeddings to all words .

Related Work

Past work that utilized lattices in neural models for natural language processing centers around using these lattices in the encoder portion of machine translation . DBLP : journals / corr / SuTXL16 utilized a variation of the Gated Recurrent Unit that operated over lattices , and preprocessed lattices over Chinese characters that allowed it to effectively encode multiple segmentations . Additionally , sperber2017neural proposed a variation of the TreeLSTM with the goal of creating an encoder over speech lattices in speech - to - text . Our work tackles language modeling rather than encoding , and thus addresses the issue of marginalization over the lattice .

Another recent work which marginalized over multiple paths through a sentence is ling2016latent . The authors tackle the problem of code generation , where some components of the code can be copied from the input , via a neural network . Our work expands on this by handling multi - word tokens as input to the neural network , rather than passing in one token at a time .

Neural lattice language models improve accuracy by helping the gradient flow over smaller paths , preventing vanishing gradients . Many hierarchical neural language models have been proposed with a similar objective koutnik2014clockwork , zhou2017chunk . Our work is distinguished from these by the use of latent token - level segmentations that capture meaning directly , rather than simply being high - level mechanisms to encourage gradient flow .

chan2016latent propose a model for predicting characters at multiple granularities in the decoder segment of a machine translation system . Our work expands on theirs by considering the entire lattice at once , rather than considering a only a single path through the lattice via ancestral sampling . This allows us to train end - to - end without the model collapsing to a local minimum , with no exploration bonus needed . Additionally , we propose a more broad class of models , including those incorporating polysemous words , and apply our model to the task of word - level language modeling , rather than character - level transcription .

Concurrently to this work , van2017multiscale have proposed a neural language model that can similarly handle multiple scales . Our work is differentiated in that it is more general : utilizing an open multi - token vocabulary , proposing multiple techniques for hidden state calculation , and handling polysemy using multi - embedding lattices .

Future Work

In the future , we would like to experiment with utilizing neural lattice language models in extrinsic evaluation , such as machine translation and speech recognition . Additionally , in the current model , the non - compositional embeddings must be selected a priori , and may be suboptimal . We are exploring techniques to store fixed embeddings dynamically , so that the non - compositional phrases can be selected as part of the end - to - end training .

Conclusion

In this work , we have introduced the idea of a neural lattice language model , which allows us to marginalize over all segmentations of a sentence in an end - to - end fashion . In our experiments on the Billion Word Corpus and Chinese GigaWord corpus , we demonstrated that the neural lattice language model beats an LSTM - based baseline at the task of language modeling , both when it is used to incorporate multiple - word phrases and multiple - embedding words . Qualitatively , we observed that the latent segmentations generated by the model correspond well to human intuition about multi - word phrases , and that the varying usage of words with multiple embeddings seems to also be sensible .

Large-Scale Experiments

To verify that our findings scale to state - of - the - art language models , we also compared a baseline model , dense lattices of size 1 and 2 , and a multilattice with 2 embeddings per word on the full byte - pair encoded Billion Word Corpus .

In this set of experiments , we take the full Billion Word Corpus , and apply byte - pair encoding as described by sennrich2015neural to construct a vocabulary of 10,000 sub - word tokens . Our model consists of three LSTM layers , each with 1500 hidden units . We train the model for a single epoch over the corpus , using the Adam optimizer with learning rate .0001 on a P100 GPU . We use a batch size of 40 , and variational dropout of 0.1 . The 10,000 sub - word embeddings each had dimension 600 . For lattice models , chunk vocabularies were selected by taking the 10,000 sub - words in the vocabulary and adding the most common 10,000 n -grams with 1 . The weights on the final layer of the network were tied with the input embeddings , as done by press2016using , inan2016tying . In all lattice models , hidden states were computed using weighted expectation ( § UID22 ) . In multi - embedding models , embedding sizes were decreased so as to maintain the same total number of parameters .

Results of these experiments are in Table
TABREF45
. The performance of the baseline model is roughly on par with that of state - of - the - art models on this database ; differences can be explained by model size and hyperparameter tuning . The results show the same trend as the results of our main experiments , indicating that the performance gains shown by our smaller neural lattice language models generalize to the much larger datasets used in state - of - the - art systems .

Chunk Vocabulary Size

We compare a 2-lattice with a non - compositional chunk vocabulary of 10,000 phrases with a 2-lattice with a non - compositional chunk vocabulary of 20,000 phrases . The results can be seen in Table
TABREF46
. Doubling the number of non - compositional embeddings present decreases the perplexity , but only by a small amount . This is perhaps to be expected , given that doubling the number of embeddings corresponds to a large increase in the number of model parameters for phrases that may have less data with which to train them .

recurrent neural networks RNN a natural language processing
language modeling probability distribution over sequences of tokens that corresponds to observed sentences
lattice language models a lattice over possible paths through a sentence
probability the marginal over all paths that lead to generating the reference sentence
LSTM language models calculate the hidden and cell states
c t INLINEFORM1 and the input INLINEFORM3
“ chunk ” x i j describe a sequence of one more tokens that represents a portion of the full string INLINEFORM3 , containing the unit tokens INLINEFORM4 through INLINEFORM5
“ token vocabulary ” the subset of the vocabulary containing only tokens , and chunk vocabulary contains all chunks
X the edges of paths through a lattice over token - level prefixes of
probability of a specific prefix calculated by marginalizing over all segmentations leading up to INLINEFORM1
location i =s t-1 indicates that the edge between prefix INLINEFORM5 and prefix INLINEFORM6 , corresponding to token INLINEFORM7 , exists in the lattice
distribution probability
argmax equivalent to sampling from the probability distribution
predecessor x a chance of being selected equal to INLINEFORM4 sampling
non - compositional of each unit - level
sub - LSTM initialized by passing in the hidden state of the main lattice LSTM at that timestep
embedding to maintain the same total number of parameters
models an LSTM baseline on task of language modeling
model identifies bigrams correspond to non - compositional compounds , like “ prime minister ” correspond to compositional compounds
embeddings captures an idiosyncrasy of training
Additionally a with the goal
neural lattice language model allows us over all of -
non a - compositional chunk vocabulary of 10,000 phrases with a non - compositional chunk vocabulary of 20,000 phrases
Context-Aware Prediction of Derivational Word-forms 1702.06675 2017 E17-2019
machine translation.
derivational morphology,
abstractive summarisation
derivational forms

Introduction

Understanding how new words are formed is a fundamental task in linguistics and language modelling , with significant implications for tasks with a generation component , such as abstractive summarisation and machine translation . In this paper we focus on modelling derivational morphology , to learn , e.g. , that the appropriate derivational form of the verb succeed is succession given the context As third in the line of word ... , but is success in The play was a great word .

English is broadly considered to be a morphologically impoverished language , and there are certainly many regularities in morphological patterns , e.g. , the common usage of -able to transform a verb into an adjective , or -ly to form an adverb from an adjective . However there is considerable subtlety in English derivational morphology , in the form of : ( a ) idiosyncratic derivations ; e.g. picturesque vs. beautiful vs. splendid as adjectival forms of the nouns picture , beauty and splendour , respectively ; ( b ) derivational generation in context , which requires the automatic determination of the part - of - speech ( POS ) of the stem and the likely POS of the word in context , and POS - specific derivational rules ; and ( c ) multiple derivational forms often exist for a given stem , and these must be selected between based on the context ( e.g. success and succession as nominal forms of success , as seen above ) . As such , there are many aspects that affect the choice of derivational transformation , including morphotactics , phonology , semantics or even etymological characteristics . Earlier works BIBREF0 analysed ambiguity of derivational suffixes themselves when the same suffix might present different semantics depending on the base form it is attached to ( cf . beautiful vs. cupful ) . Furthermore , as richardson1977lexical previously noted , even words with quite similar semantics and orthography such as horror and terror might have non - overlapping patterns : although we observe regularity in some common forms , for example , horrify and terrify , and horrible and terrible , nothing tells us why we observe terrorize and no instances of horrorize , or horrid , but not terrid .

In this paper , we propose the new task of predicting a derived form from its context and a base form . Our motivation in this research is primarily linguistic , i.e. we measure the degree to which it is possible to predict particular derivation forms from context . A similar task has been proposed in the context of studying how children master derivations
BIBREF1
.
In their work , children were asked to complete a sentence by choosing one of four possible derivations . Each derivation corresponded either to a noun , verb , adjective , or adverbial form . singson2000relation showed that childrens ' ability to recognize the correct form correlates with their reading ability . This observation confirms an earlier idea that orthographical regularities provide a clearer clues to morphological transformations comparing to phonological rules
BIBREF2
,
BIBREF3
, especially in languages such as English where grapheme - phoneme correspondences are opaque . For this reason we consider orthographic rather than phonological representations .

In our approach , we test how well models incorporating distributional semantics can capture derivational transformations . Deep learning models capable of learning real - valued word embeddings have been shown to perform well on a range of tasks , from language modelling
BIBREF4
to parsing BIBREF5 and machine translation
BIBREF6
. Recently , these models have also been successfully applied to morphological reinflection tasks BIBREF7 ,
BIBREF8
.

Derivational Morphology

Morphology , the linguistic study of the internal structure of words , has two main goals : ( 1 ) to describe the relation between different words in the lexicon ; and ( 2 ) to decompose words into morphemes , the smallest linguistic units bearing meaning . Morphology can be divided into two types : inflectional and derivational . Inflectional morphology is the set of processes through which the word form outwardly displays syntactic information , e.g. , verb tense . It follows that an inflectional affix typically neither changes the part - of - speech ( POS ) nor the semantics of the word . For example , the English verb to run takes various forms : run , runs and ran , all of which convey the concept “ moving by foot quickly ” , but appear in complementary syntactic contexts .

Derivation , on the other hand , deals with the formation of new words that have semantic shifts in meaning ( often including POS ) and is tightly intertwined with lexical semantics
BIBREF9
. Consider the example of the English noun discontentedness , which is derived from the adjective discontented . It is true that both words share a close semantic relationship , but the transformation is clearly more than a simple inflectional marking of syntax . Indeed , we can go one step further and define a chain of words content contented discontented discontentedness .

In this work , we deal with the formation of deverbal nouns , i.e. , nouns that are formed from verbs . Common examples of this in English include agentives ( e.g. , explain explainer ) , gerunds ( e.g. , explain explaining ) , as well as other nominalisations ( e.g. , explain explanation ) . Nominalisations have varyingly different meanings from their base verbs , and a key focus of this study is the prediction of which form is most appropriate depending on the context , in terms of syntactic and semantic concordance . Our model is highly flexible and easily applicable to other related lexical problems .

Related Work

Although in the last few years many neural morphological models have been proposed , most of them have focused on inflectional morphology ( e.g. , see cotterell - EtAl:2016 : SIGMORPHON ) . Focusing on derivational processes , there are three main directions of research . The first deals with the evaluation of word embeddings either using a word analogy task
BIBREF10
or binary relation type classification
BIBREF11
. In this context , it has been shown that , unlike inflectional morphology , most derivational relations can not be as easily captured using distributional methods . Researchers working on the second type of task attempt to predict derived forms using the embedding of its corresponding base form and a vector encoding a “ derivational ” shift . guevara2011computing notes that derivational affixes can be modelled as a geometrical function over the vectors of the base forms . On the other hand , lazaridou2013compositional and DBLP : journals / corr / CotterellS17 represent derivational affixes as vectors and investigate various functions to combine them with base forms . kisselew2015obtaining and padopredictability extend this line of research to model derivational morphology in German . This work demonstrates that various factors such as part of speech , semantic regularity and argument structure BIBREF12 influence the predictability of a derived word . The third area of research focuses on the analysis of derivationally complex forms , which differs from this study in that we focus on generation . The goal of this line of work is to produce a canonicalised segmentation of an input word into its constituent morphs , e.g. , unhappiness un + happy + ness
BIBREF13
,
BIBREF14
. Note that the orthographic change y i has been reversed .

Dataset

As the starting point for the construction of our dataset , we used the CELEX English dataset
BIBREF15
. We extracted verb – noun lemma pairs from CELEX , covering 24 different nominalisational suffixes and 1,456 base lemmas . Suffixes only occurring in 5 or fewer lemma pairs mainly corresponded to loan words and consequently were filtered out . We augmented this dataset with verb – verb pairs , one for each verb present in the verb – noun pairs , to capture the case of a verbal form being appropriate for the given context . For each noun and verb lemma , we generated all their inflections , and searched for sentential contexts of each inflected token in a pre - tokenised dump of English Wikipedia . To dampen the effect of high - frequency words , we applied a heuristic log function threshold which is basically a weighted logarithm of the number of the contexts . The final dataset contains 3,079 unique lemma pairs represented in 107,041 contextual instances .

Experiments

In this paper we model derivational morphology as a prediction task , formulated as follows . We take sentences containing a derivational form of a given lemma , then obscure the derivational form by replacing it with its base form lemma . The system must then predict the original ( derivational ) form , which may make use of the sentential context . System predictions are judged correct if they exactly match the original derived form .

Baseline

As a baseline we considered a trigram model with modified Kneser - Ney smoothing , trained on the training dataset . Each sentence in the testing data was augmented with a set of confabulated sentences , where we replaced a target word with other its derivations or a base form . Unlike the general task , where we generate word forms as character sequences , here we use a set of known inflected forms for each lemma ( from the training data ) . We then use the language model to score the collections of test sentences , and selected the variant with the highest language model score , and evaluate accuracy of selecting the original word form .

Encoder–Decoder Model

We propose an encoder – decoder model . The encoder combines the left and the right contexts as well as a character - level base form representation : DISPLAYFORM0

where 𝐡 left → , 𝐡 left ← , 𝐡 right → , 𝐡 right ← , 𝐡 base → , 𝐡 base ← correspond to the last hidden states of an LSTM BIBREF16 over left and right contexts and the character - level representation of the base form ( in each case , applied forwards and backwards ) , respectively ; H∈ℝ [h×l×1.5,h×l×6] is a weight matrix , and 𝐛 h ∈ℝ [h×l×1.5] is a bias term . [;] denotes a vector concatenation operation , h is the hidden state dimensionality , and l is the number of layers .

Next we add an extra affine transformation , 𝐨=T·𝐭+𝐛 o , where T∈ℝ [h×l×1.5,h×l] and 𝐛 o ∈ℝ [h×l] , then 𝐨 is then fed into the decoder : DISPLAYFORM0

where 𝐜 j is an embedding of the j -th character of the derivation , 𝐥 j+1 is an embedding of the corresponding base character , B,S,R are weight matrices , and 𝐛 d is a bias term .

We now elaborate on the design choices behind the model architecture which have been tailored to our task . We supply the model with the l j+1 character prefix of the base word to enable a copying mechanism , to bias the model to generate a derived form that is morphologically - related to the base verb . In most cases , the derived form is longer than its stem , and accordingly , when we reach the end of the base form , we continue to input an end - of - word symbol . We provide the model with the context vector 𝐨 at each decoding step . It has been previously shown BIBREF17 that this yields better results than other means of incorporation . Finally , we use max pooling to enable the model to switch between copying of a stem or producing a new character .

Settings

We used a 3-layer bidirectional LSTM network , with hidden dimensionality h for both context and base - form stem states of 100 , and character embedding 𝐜 j of 100 . We used pre - trained 300-dimensional Google News word embeddings
BIBREF4
,
BIBREF18
. During the training of the model , we keep the word embeddings fixed , for greater applicability to unseen test instances . All tokens that did n't appear in this set were replaced with UNK sentinel tokens . The network was trained using SGD with momentum until convergence .

Results

With the encoder – decoder model , we experimented with the encoder – decoder as described in Section SECREF6 ( “ biLSTM+CTX+BS ) , as well as several variations , namely : excluding context information ( “ biLSTM+BS ” ) , and excluding the bidirectional stem ( “ biLSTM+CTX ” ) . We also investigated how much improvement we can get from knowing the POS tag of the derived form , by presenting it explicitly to the model as extra conditioning context ( “ biLSTM+CTX+BS+POS ” ) . The main motivation for this relates to gerunds , where without the POS , the model often overgenerates nominalisations . We then tried a single - directional context representation , by using only the last hidden states , i.e. , 𝐡 left → and 𝐡 right ← , corresponding to the words to the immediate left and right of the wordform to be predicted ( “ LSTM+CTX+BS+POS ” ) .

We ran two experiments : first , a shared lexicon experiment , where every stem in the test data was present in the training data ; and second , using a split lexicon , where every stem in the test data was unseen in the training data . The results are presented in Table
TABREF10
, and show that : ( 1 ) context has a strong impact on results , particularly in the shared lexicon case ; ( 2 ) there is strong complementarity between the context and character representations , particularly in the split lexicon case ; and ( 3 ) POS information is particularly helpful in the split lexicon case . Note that most of the models significantly outperform our baseline under shared lexicon setting . The baseline model does n't support the split lexicon setting ( as the derivational forms of interest , by definition , do n't occur in the training data ) , so we can not generate results in this setting .

Error Analysis

We carried out error analysis over the produced forms of the LSTM+CTX+BS+POS model . First , the model sometimes struggles to differentiate between nominal suffixes : in some cases it puts an agentive suffix ( -er or -or ) in contexts where a non - agentive nominalisation ( e.g. -ation or -ment ) is appropriate . As an illustration of this , Figure
FIGREF15
is a t - SNE projection of the context representations for simulate vs. simulator vs. simulation , showing that the different nominal forms have strong overlap . Secondly , although the model learns whether to copy or produce a new symbol well , some forms are spelled incorrectly . Examples of this are studint , studion or even studyant rather than student as the agentive nominalisation of study . Here , the issue is opaqueness in the etymology , with student being borrowed from the Old French estudiant . For transformations which are native to English , for example , -ate -ation , the model is much more accurate . Table
TABREF16
shows recall values achieved for various suffix types . We do not present precision since it could not be reliably estimated without extensive manual analysis .

In the split lexicon setting , the model sometimes misses double consonants at the end of words , producing wraper and winer and is biased towards generating mostly productive suffixes . An example of the last case might be stoption in place of stoppage . We also studied how much the training size affects the model 's accuracy by reducing the data from 1,000 to 60,000 instances ( maintaining a balance over lemmas ) . Interestingly , we did n't observe a significant reduction in accuracy . Finally , note that under the split lexicon setting , the model is agnostic of existing derivations , sometimes over - generating possible forms . A nice illustration of that is trailation , trailment and trailer all being produced in the contexts of trailer . In other cases , the model might miss some of the derivations , for instance , predicting only government in the contexts of governance and government . We hypothesize that it is either due to very subtle differences in their contexts , or the higher productivity of -ment .

Finally , we experimented with some nonsense stems , overwriting sentential instances of transcribe to generate context - sensitive derivational forms . Table
TABREF17
presents the nonsense stems , the correct form of transcribe for a given context , and the predicted derivational form of the nonsense word . Note that the base form is used correctly ( top row ) for three of the four nonsense words , and that despite the wide variety of output forms , they resemble plausible words in English . By looking at a larger slice of the data , we observed some regularities . For instance , fapery was mainly produced in the contexts of transcript whereas fapication was more related to transcription . Table
TABREF17
also shows that some of the stems appear to be more productive than others .

Conclusions and Future Work

We investigated the novel task of context - sensitive derivation prediction for English , and proposed an encoder – decoder model to generate nominalisations . Our best model achieved an accuracy of 90 % on a shared lexicon , and 66 % on a split lexicon . This suggests that there is regularity in derivational processes and , indeed , in many cases the context is indicative . As we mentioned earlier , there are still many open questions which we leave for future work . Further , we plan to scale to other languages and augment our dataset with Wiktionary data , to realise much greater coverage and variety of derivational forms .

Acknowledgments

We would like to thank all reviewers for their valuable comments and suggestions . The second author was supported by a DAAD Long - Term Research Grant and an NDSEG fellowship . This research was supported in part by the Australian Research Council .

vs. beautiful vs. splendid adjectival the nouns picture , beauty and splendour , )
- - part of
similar task proposed in the context of studying how children master derivations BIBREF1
Morphology two types inflectional and derivational
main to describe the relation between different words in the lexicon ; and the smallest linguistic units bearing meaning
morphology the set of processes through which the word form outwardly displays syntactic information , e.g. , verb tense
- speech of
task a prediction
[;] a vector concatenation operation
h the hidden state dimensionality
l the number of layers
𝐜 j an embedding of the INLINEFORM1 -th character of the derivation
𝐥 j+1 an embedding of the corresponding base character , INLINEFORM3 are weight matrices a bias term
SECREF6 excluding context information
Deep learning for extracting protein-protein interactions from biomedical literature 1706.01556 2017 W17-2304
syntactic sentence structure
machine learning
sentence syntactic structure.
interaction relations
convolutional neural networks
convolutional neural network
natural language processing

Introduction

With the growing amount of biomedical information available in the textual form , there has been considerable interest in applying natural language processing ( NLP ) techniques and machine learning ( ML ) methods to the biomedical literature
BIBREF0
, BIBREF1 ,
BIBREF2
,
BIBREF3
. One of the most important tasks is to extract protein - protein interaction relations BIBREF4 .

Protein - protein interaction ( PPI ) extraction is a task to identify interaction relations between protein entities mentioned within a document . While PPI relations can span over sentences and even cross documents , current works mostly focus on PPI in individual sentences
BIBREF5
,
BIBREF6
. For example , “ ARFTS ” and “ XIAP - BIR3 ” are in a PPI relation in the sentence “ ARFTS PROT1 specifically binds to a distinct domain in XIAP - BIR3 PROT2 ” .

Recently , deep learning methods have achieved notable results in various NLP tasks
BIBREF7
. For PPI extraction , convolutional neural networks ( CNN ) have been adopted and applied effectively
BIBREF8
,
BIBREF9
, BIBREF10 .
Compared with traditional supervised ML methods , the CNN model is more generalizable and does not require tedious feature engineering efforts . However , how to incorporate linguistic and semantic information into the CNN model remains an open question . Thus previous CNN - based methods have not achieved state - of - the - art performance in the PPI task
BIBREF11
.

In this paper , we propose a multichannel dependency - based convolutional neural network , McDepCNN , to provide a new way to model the syntactic sentence structure in CNN models . Compared with the widely - used one - hot CNN model ( e.g. , the shortest - path information is firstly transformed into a binary vector which is zero in all positions except at this shortest - path 's index , and then applied to CNN ) , McDepCNN utilizes a separate channel to capture the dependencies of the sentence syntactic structure .

To assess McDepCNN , we evaluated our model on two benchmarking PPI corpora , AIMed
BIBREF12
and BioInfer
BIBREF13
. Our results show that McDepCNN performs better than the state - of - the - art feature- and kernel - based methods .

We further examined McDepCNN in two experimental settings : a cross - corpus evaluation and an evaluation on a subset of “ difficult ” PPI instances previously reported
BIBREF14
. Our results suggest that McDepCNN is more generalizable and capable of capturing long distance information than kernel methods .

The rest of the manuscript is organized as follows . We first present related work . Then , we describe our model in Section SECREF3 , followed by an extensive evaluation and discussion in Section SECREF4 . We conclude in the last section .

Related work

From the ML perspective , we formulate the PPI task as a binary classification problem where discriminative classifiers are trained with a set of positive and negative relation instances . In the last decade , ML - based methods for the PPI tasks have been dominated by two main types : the feature - based vs. kernel based method . The common characteristic of these methods is to transform relation instances into a set of features or rich structural representations like trees or graphs , by leveraging linguistic analysis and knowledge resources . Then a discriminative classifier is used , such as support vector machines
BIBREF15
or conditional random fields BIBREF16 .

While these methods allow the relation extraction systems to inherit the knowledge discovered by the NLP community for the pre - processing tasks , they are highly dependent on feature engineering BIBREF17 ,
BIBREF18
,
BIBREF19
,
BIBREF20
. The difficulty with feature - based methods is that data can not always be easily represented by explicit feature vectors .

Since natural language processing applications involve structured representations of the input data , deriving good features is difficult , time - consuming , and requires expert knowledge . Kernel - based methods attempt to solve this problem by implicitly calculating dot products for every pair of examples
BIBREF21
,
BIBREF22
, BIBREF23 , BIBREF24 , BIBREF25 . Instead of extracting feature vectors from examples , they apply a similarity function between examples and use a discriminative method to label new examples
BIBREF6
. However , this method also requires manual effort to design a similarity function which can not only encode linguistic and semantic information in the complex structures but also successfully discriminate between examples . Kernel - based methods are also criticized for having higher computational complexity
BIBREF26
.

Convolutional neural networks ( CNN ) have recently achieved promising results in the PPI task
BIBREF8
, BIBREF10 .
CNNs are a type of feed - forward artificial neural network whose layers are formed by a convolution operation followed by a pooling operation BIBREF27 . Unlike feature- and kernel - based methods which have been well studied for decades , few studies investigated how to incorporate syntactic and semantic information into the CNN model . To this end , we propose a neural network model that makes use of automatically learned features ( by different CNN layers ) together with manually crafted ones ( via domain knowledge ) , such as words , part - of - speech tags , chunks , named entities , and dependency graph of sentences . Such a combination in feature engineering has been shown to be effective in other NLP tasks also ( e.g.
BIBREF28
) .

Furthermore , we propose a multichannel CNN , a model that was suggested to capture different “ views ” of input data . In the image processing , BIBREF29 applied different RGB ( red , green , blue ) channels to color images . In NLP research , such models often use separate channels for different word embeddings
BIBREF30
,
BIBREF31
. For example , one could have separate channels for different word embeddings
BIBREF9
, or have one channel that is kept static throughout training and the other that is fine - tuned via backpropagation BIBREF32 . Unlike these studies , we utilize the head of the word in a sentence as a separate channel .

Model Architecture Overview

Figure
FIGREF2
illustrates the overview of our model , which takes a complete sentence with mentioned entities as input and outputs a probability vector ( two elements ) corresponding to whether there is a relation between the two entities . Our model mainly consists of three layers : a multichannel embedding layer , a convolution layer , and a fully - connected layer .

Embedding Layer

In our model , as shown in Figure
FIGREF2
, each word in a sentence is represented by concatenating its word embedding , part - of - speech , chunk , named entity , dependency , and position features .

Word embedding is a language modeling techniques where words from the vocabulary are mapped to vectors of real numbers . It has been shown to boost the performance in NLP tasks . In this paper , we used pre - trained word embedding vectors
BIBREF33
learned on PubMed articles using the word2vec tool
BIBREF34
. The dimensionality of word vectors is 200 .

We used the part - of - speech ( POS ) feature to extend the word embedding . Similar to
BIBREF35
, we divided POS into eight groups . Then each group is mapped to an eight - bit binary vector . In this way , the dimensionality of the POS feature is 8 .

We used the chunk tags obtained from Genia Tagger for each word
BIBREF36
. We encoded the chunk features using a one - hot scheme . The dimensionality of chunk tags is 18 .

To generalize the model , we used four types of named entity encodings for each word . The named entities were provided as input by the task data . In one PPI instance , the types of two proteins of interest are PROT1 and PROT2 respectively . The type of other proteins is PROT , and the type of other words is O. If a protein mention spans multiple words , we marked each word with the same type ( we did not use a scheme such as IOB ) . The dimensionality of named entity is thus 4 .

To add the dependency information of each word , we used the label of “ incoming ” edge of that word in the dependency graph . Take the sentence from Figure
FIGREF9
as an example , the dependency of “ ARFTS ” is “ nsubj ” and the dependency of “ binds ” is “ ROOT ” . We encoded the dependency features using a one - hot scheme , and their dimensionality is 101 .

In this work , we consider the relationship of two protein mentions in a sentence . Thus , we used the position feature proposed in BIBREF37 , which consists of two relative distances , d1 and d2 , for representing the distances of the current word to PROT1 and PROT2 respectively . For example in Figure
FIGREF9
, the relative distances of the word “ binds ” to PROT1 ( “ ARFTs ” ) and PROT2 ( “ XIAP - BIR3 ” ) are 2 and -6 , respectively . Same as in Table S4 of
BIBREF35
, both d1 and d2 are non - linearly mapped to a ten - bit binary vector , where the first bit stands for the sign and the remaining bits for the distance .

Multichannel Embedding Layer

A novel aspect of McDepCNN is to add the “ head ” word representation of each word as the second channel of the embedding layer . For example , the second channel of the sentence in Figure
FIGREF9
is “ binds binds ROOT binds domain domain binds domain ” as shown in Figure
FIGREF2
. There are several advantages of using the “ head ” of a word as a separate channel .

First , it intuitively incorporates the dependency graph structure into the CNN model . Compared with BIBREF10 which used the shortest path between two entities as the sole input for CNN , our model does not discard information outside the scope of two entities . Such information was reported to be useful
BIBREF38
. Compared with
BIBREF35
which used the shortest path as a bag - of - word sparse 0 - 1 vector , our model intuitively reflects the syntactic structure of the dependencies of the input sentence .


Second , together with convolution , our model can better capture longer distance dependencies than the sliding window size . As shown in Figure
FIGREF9
, the second channel of McDepCNN breaks the dependency graph structure into structural < head word , child word > pairs where each word is a modifier of its previous word . In this way , it reflects the skeleton of a constituent where the second channel shadows the detailed information of all sub - constituents in the first channel . From the perspective of the sentence string , the second channel is similar to a gapped n -gram or a skipped n -gram where the skipped words are based on the structure of the sentence .

Convolution

We applied convolution to input sentences to combine two channels and get local features BIBREF39 . Consider x 1 ,⋯,x n to be the sequence of word representations in a sentence where DISPLAYFORM0

Here is concatenation operation so x i ∈ℝ d is the embedding vector for the i th word with the dimensionality d . Let x i:i+k-1 c represent a window of size k in the sentence for channel c . Then the output sequence of the convolution layer is DISPLAYFORM0

where f is a rectify linear unit ( ReLU ) function and b k is the biased term . Both w k c and b k are the learning parameters .

1-max pooling was then performed over each map , i.e. , the largest number from each feature map was recorded . In this way , we obtained fixed length global features for the whole sentence . The underlying intuition is to consider only the most useful feature from the entire sentence . DISPLAYFORM0

Fully Connected Layer with Softmax

To make a classifier over extracted global features , we first applied a fully connected layer to the feature vectors of multichannel obtained above . DISPLAYFORM0

The final softmax then receives this vector O as input and uses it to classify the PPI ; here we assume binary classification for the PPI task and hence depict two possible output states . DISPLAYFORM0

Here , θ is a vector of the hyper - parameters of the model , such as w k c , b k , w o , and b o . Further , we used dropout technique in the output of the max pooling layer for regularization
BIBREF40
. This prevented our method from overfitting by randomly “ dropping ” with probability (1-p) neurons during each forward / backward pass while training .

Training

To train the parameters , we used the log - likelihood of parameters on a mini - batch training with a batch size of m . We use the Adam algorithm to optimize the loss function BIBREF41 . DISPLAYFORM0

Experimental setup

For our experiments , we used the Genia Tagger to obtain the part - of - speech , chunk tags , and named entities of each word
BIBREF36
. We parsed each sentence using the Bllip parser with the biomedical model
BIBREF42
, BIBREF43 . The universal dependencies were then obtained by applying the Stanford dependencies converter on the parse tree with the CCProcessed and Universal options BIBREF44 .

We implemented the model using TensorFlow
BIBREF45
. All trainable variables were initialized using the Xavier algorithm BIBREF46 . We set the maximum sentence length to 160 . That is , longer sentences were pruned , and shorter sentences were padded with zeros . We set the learning rate to be 0.0007 and the dropping probability 0.5 . During the training , we ran 250 epochs of all the training examples . For each epoch , we randomized the training examples and conducted a mini - batch training with a batch size of 128 ( m=128 ) .

In this paper , we experimented with three window sizes : 3 , 5 and 7 , each of which has 400 filters . Every filter performs convolution on the sentence matrix and generates variable - length feature maps . We got the best results using the single window of size 3 ( see Section SECREF25 )

Data

We evaluated McDepCNN on two benchmarking PPI corpora , AIMed
BIBREF12
and BioInfer
BIBREF13
. These two corpora have different sizes ( Table
TABREF23
) and vary slightly in their definition of PPI
BIBREF5
.

BIBREF6
conducted a comparison of a variety of PPI extraction systems on these two corpora . In order to compare , we followed their experimental setup to evaluate our methods : self - interactions were excluded from the corpora and 10-fold cross - validation ( CV ) was performed .

Results and discussion

Our system performance , as measured by Precision , Recall , and F1-score , is shown in Table
TABREF26
.
To compare , we also include the results published in
BIBREF6
,
BIBREF47
,
BIBREF18
, BIBREF17 . Row 2 reports the results of the previous best deep learning system on these two corpora . Rows 3 and 4 report the results of two previous best single kernel - based methods , an APG kernel
BIBREF22
,
BIBREF6
and an edit kernel
BIBREF47
. Rows 5 - 6 report the results of two rule - based systems . As can be seen , McDepCNN achieved the highest results in both precision and overall F1-score on both datasets .

Note that we did not compare our results with two recent deep - learning approaches BIBREF10 ,
BIBREF9
. This is because unlike other previous studies , they artificially removed sentences that can not be parsed and discarded pairs which are in a coordinate structure . Thus , our results are not directly comparable with theirs . Neither did we compare our method with
BIBREF19
because they combined , in a rich vector , analysis from different parsers and the output of multiple kernels .

To further test the generalizability of our method , we conducted the cross - corpus experiments where we trained the model on one corpus and tested it on the other ( Table
TABREF27
) . Here we compared our results with the shallow linguistic model which is reported as the best kernel - based method in
BIBREF14
.

The cross - corpus results show that McDepCNN achieved 24.4 % improvement in F - score when trained on BioInfer and tested on AIMed , and 18.2 % vice versa .

To better understand the advantages of McDepCNN over kernel - based methods , we followed the lead of
BIBREF14
to compare the method performance on some known “ difficult ” instances in AIMed and BioInfer . This subset of difficult instances is defined as 10 % of all pairs with the least number of 14 kernels being able to classify correctly ( Table
TABREF28
) .

Table
TABREF31
shows the comparisons between McDepCNN and kernel - based methods on difficult instances . The results of McDepCNN were obtained from the difficult instances combined from AIMed and BioInfer ( 172 positives and 479 negatives ) . And the results of APG , Edit , and SL were obtained from AIMed , BioInfer , HPRD50 , IEPA , and LLL ( 190 positives and 521 negatives )
BIBREF14
. While the input datasets are different , our outcomes are remarkably higher than the prior studies . The results show that McDepCNN achieves 17.3 % in F1-score on difficult instances – which is more than three times better than other kernels . Since there are no examples of difficult instances that could not be classified correctly by at least one of the 14 kernel methods , below , we only list some examples that McDepCNN can classify correctly .

Immunoprecipitation experiments further reveal that the fully assembled receptor complex is composed of two IL-6 PROT1 , two IL-6R alpha PROT2 , and two gp130 molecules .

The phagocyte NADPH oxidase is a complex of membrane cytochrome b558 ( comprised of subunits p22-phox and gp91-phox ) and three cytosol proteins ( p47-phox PROT1 , p67-phox , and p21rac ) that translocate to membrane and bind to cytochrome b558 PROT2 .

Together with the conclusions in
BIBREF14
, “ positive pairs are more difficult to classify in longer sentences ” and “ most of the analyzed classifiers fail to capture the characteristics of rare positive pairs in longer sentences ” , this comparison suggests that McDepCNN is probably capable of better capturing long distance features from the sentence and are more generalizable than kernel methods .

Finally , Table
TABREF32
compares the effects of different parts in McDepCNN . Here we tested McDepCNN using 10-fold of AIMed . Row 1 used a single window with the length of 3 , row 2 used two windows , and row 3 used three windows . The reduced performance indicate that adding more windows did not improve the model . This is partially because the multichannel in McDepCNN has captured good context features for PPI . Second , we used the single channel and retrained the model with window size 3 . The performance then dropped 1.1 % . The results underscore the effectiveness of using the head word as a separate channel in CNN .

Conclusion

In this paper , we describe a multichannel dependency - based convolutional neural network for the sentence - based PPI task . Experiments on two benchmarking corpora demonstrate that the proposed model outperformed the current deep learning model and single feature - based or kernel - based models . Further analysis suggests that our model is substantially more generalizable across different datasets . Utilizing the dependency structure of sentences as a separated channel also enables the model to capture global information more effectively .

In the future , we would like to investigate how to assemble different resources into our model , similar to what has been done to rich - feature - based methods
BIBREF19
where the current best performance was reported ( F - score of 64.0 % ( AIMed ) and 66.7 % ( BioInfer ) ) . We are also interested in extending the method to PPIs beyond the sentence boundary . Finally , we would like to test and generalize this approach to other biomedical relations such as chemical - disease relations
BIBREF48
.

Acknowledgments

This work was supported by the Intramural Research Programs of the National Institutes of Health , National Library of Medicine . We are also grateful to Robert Leaman for the helpful discussion .

interaction a task to identify interaction relations between protein entities mentioned within a document
hot zero in all positions except at this shortest - path 's index utilizes capture the dependencies of the sentence syntactic structure
domain knowledge via part - of tags
Word embedding a language modeling techniques where words from the vocabulary are mapped to vectors of real numbers
BIBREF37 consists of two relative distances , INLINEFORM0
b k the biased term
shortest path a bag - of - word sparse 0 - 1 vector
model the dependencies of the input sentence
θ a vector of the hyper - parameters of the model , such as INLINEFORM1 , ,
system measured by Precision
oxidase a complex of
b558 cytochrome comprised of subunits p22-phox and gp91-phox ) and three cytosol proteins ( p47-phox INLINEFORM0 , p67-phox , and p21rac ) that translocate to membrane and bind to cytochrome b558
Second Language Acquisition Modeling: An Ensemble Approach 1806.04525 2018 W18-0525
ensemble approach
knowledge gaps
personalized learning systems
student knowledge gaps

Introduction

Understanding how students learn over time holds the key to unlock the full potential of adaptive learning . Indeed , personalizing the learning experience , so that educational content is recommended based on individual need in real time , promises to continuously stimulate motivation and the learning process
BIBREF0
. Accurate detection of students ' knowledge gaps is a fundamental building block of personalized learning systems
BIBREF1
BIBREF2
.
A number of approaches exists for modeling student knowledge and predicting student performance on future exercises including IRT BIBREF3 , BKT
BIBREF4
and DKT
BIBREF5
. Here we propose an ensemble approach to predict student knowledge gaps which achieved highest score on both evaluation metrics for all three datasets in the 2018 Shared Task on Second Language Acquisition Modeling ( SLAM )
BIBREF6
. We analyze in what cases our models ' predictions could be improved and discuss the relevance of the task setup for real - time delivery of personalized content within an educational setting .

Data and Evaluation Setup

The 2018 Shared Task on SLAM provides student trace data from users on the online educational platform Duolingo
BIBREF6
. Three different datasets are given representing user ’s responses to exercises completed over the first 30 days of learning English , French and Spanish as a second language . Common for all exercises is that the user responds with a sentence in the language learnt . Importantly , the raw input sentence from the user is not available but instead the best matching sentence among a set of correct answer sentences . The prediction task is to predict the word - level mistakes made by the user , given the best matching sentence and a number of additional features provided . The matching between user response and correct sentence was derived by the finite - state transducer method
BIBREF7
.

All datasets were pre - partitioned into training , development and test subsets , where approximately the last 10 % of the events for each user is used for testing and the last 10 % of the remaining events used for development . Target labels for token level mistakes are provided for the training and development set but not for the test set . Aggregated metrics for the test set were obtained by submitting predictions to an evaluation server provided by Duolingo . The performance for this binary classification task is measured by area under the ROC curve ( AUC ) and F1-score .

Although the dataset provided represents real user interactions on the Duolingo platform , the model evaluation setup does not represent a realistic scenario where the predictive modelling would be used for personalizing the content presented to a user . The reason for this is threefold : Firstly , predictions are made given the best matching correct sentence which would not be known prior to the user answering the question for questions that have multiple correct answers . Secondly , there are a number of variables available at each point in time which represent information from the future creating a form of data leakage . Finally , the fact that interactions from each student span all data partitions means that we can always train on the same users that the model is evaluated for and hence there are never first time users , where we would need to infer student mistakes solely from sequential behaviour . To estimate prediction performance in an educational production setting where next - step recommendations must be inferred from past observations , the evaluation procedure would have to be adjusted accordingly .

Method

To predict word - level mistakes we build an ensemble model which combines the predictions from a Gradient Boosted Decision Tree ( GBDT ) and a recurrent neural network model ( RNN ) . Our reasoning behind this approach lies in the observation that RNNs have been shown to achieve good results for sequential prediction tasks
BIBREF5
whereas GBDTs have consistently achieved state of the art results on various benchmarks for tabular data BIBREF8 . Even though the data in this case is fundamentally sequential , the number of features and the fact that interactions for each user are available during training make us expect that both models will generate accurate predictions . Details of our model implementations are given below .

The Recurrent Neural Network

The recurrent neural network model that we use is a generalisation of the model introduced by Piech dkt , based on the popular LSTM architecture , with the following key modifications :

All available categorical and numerical features are fed as input to the network and at multiple input points in the graph of the network ( see SECREF30 )

The network operates on a word level , where words from different sentences are concatenated to form a single sequence

Information is propagated backward ( as well as forward ) in time , making it possible to predict the correctness of a word given all the surrounding words within the sentence

Multiple ordinary- as well as recurrent layers are stacked , with the information from each level cascaded through skip - connections
BIBREF9
to form the final prediction

In model training , subsequences of up to 256 interactions are sampled from each user history in the train dataset , and only the second half of each subsequence is included in the loss function . The binary target variable representing word - level mistakes is expanded to a categorical variable and set to unknown for the second half of each subsequence in order to match the evaluation setup .

Log loss of predictions for each subsequence is minimised using adaptive moment estimation
BIBREF10
with a batch size of 32 . Regularisation with dropout
BIBREF11
and L2 regularisation
BIBREF12
is used for embeddings , recurrent and feed forward layers . Data points are used once over each of 80 epochs , and performance continuously evaluated on 70 % of the dev data after each epoch . The model with highest performance over all epochs is then selected after training has finished . Finally , Gaussian Process Bandit Optimization BIBREF13 is used to tune the hyperparameters learning rate , number of units in each layer , dropout probability and L2 coefficients .

The Gradient Boosted Decision Tree

The decision tree model is built using the LightGBM framework
BIBREF14
which implements a way of optimally partitioning categorical features , leaf - wise tree growth , as well as histogram binning for continuous variables
BIBREF15
.
In addition to the variables provided in the student trace data we engineer a number of features which we anticipate should have relevance for predicting the word level mistakes

How many times the current token has been practiced

Time since token was last seen

Position index of token within the best matching sentence

The total number of tokens in

Position index of exercise within session

Preceding token

A unique identifier of the best matching sentence as a proxy for exercise i d

Optimal model parameters are learned through a grid search by training the model on the training set and evaluating on the development set to optimize AUC . The optimal GBDT parameter settings for each dataset can be found in the Supplementary Material SECREF42 .

Ensemble Approach

The predictions generated by the recurrent neural network model and the GBDT model are combined through a weighted average . We train each model using its optimal hyperparameter setting on the train dataset and generate predictions on the dev set . The optimal ensemble weights are then found by varying the proportion of each model prediction and choosing the weight combination which yields optimal AUC score ( Figure
FIGREF15
) .


Finally , the RNN and GBDT were trained using their respective optimal hyperparameter settings on the training and development datasets to generate predictions on the test sets . The individual model test set predictions were then combined using the optimal ensemble weights to generate the final test set predictions for task submission .

Discussion

Our ensemble approach yielded superior prediction performance on the test set compared to the individual performances of the ensemble components ( Table
TABREF16
) . The F1 scores of our ensemble are reported in Table
TABREF17
. We note that although the within - ensemble prediction correlations are high ( Table
TABREF18
) , the prediction diversity evidently suffices for the ensemble combination to outperform the underlying models . This suggests that the RNN and the GBDT differ in performance on different word mistakes . Most likely , the temporal dynamics modelled by the neural network model complement the GBDT predictions enabling the ensemble to generalise better to unseen user events than its component parts . Notably , none of our individual models would have yielded first place in the Shared Task .

Feature Importance

Given the predictive power of our model we can use the model components to gain insight into what features are most valuable when inferring student mistake patterns . When ranking GBDT features by information gain , we note that 4 out of 5 features overlap between the three datasets ( Figure