Workshop 2 - Machine Learning-Based Approaches for Text Mining

Objectives

Participants may leave this workshop with skills to:

Perform the topic modeling algorithm on the textual data in R
Understand the concept of word embedding representation and apply it in R
Learn a hidden representation of deep neural network from textual data in R

Instructions

Before beginning, please test to see if the Rmd file will compile on your system by clicking the “Knit HTML button” in R studio above.

Machine learning approaches for text mining

Machine learning is a powerful tool for text mining, especially finding the hidden representation, which is impossible if we just use regular expressions or simple NLP techniques.

In this workshop, we will use machine learning algorithms to perform topic modeling, word embedding and extracting a hidden representation of the text using deep autoencoder. We will use the iDASH dataset and R language for the workshop. First, we will import the iDASH data and labels (please see the previous workshop for the introduction of the dataset).

# read the iDASH data
idash <- read.table("idash.txt", sep = "\t", header = FALSE, comment.char="", stringsAsFactors = FALSE)
# store texts to "data"
data <- idash$V1
# store labels to "label"
label <- as.factor(idash$V2)

Learning language model to find the theme of the document - topic modeling

Topic modeling is an unsupervised learning technique to summarize and organize a large amount of textual information. Topic modeling algorithms can help us identify the words/terms which can be grouped into the same cluster (refer to “topic”) from a collection of documents, discover hidden topics among the documents, annotate/label the documents by the identified groups (topics), and therefore understand, summarize, separate, and organize bunch of textual data. In other words, topic modeling may help us find out a hidden thematic representation of the document.

We choose the Latent Dirichlet allocation (LDA) algorithm, which is a common and popular mathematical topic modeling method developed by Prof. David Blei, for topic modeling. LDA can estimate both of the followings at the same time: (1) each word in the document collection is attributable to one of the document’s topics, and (2) each document (a set of words) is viewed as a mixture of topics that are present in the document collection. For example:

Document 1: a cardiology admission note, 80% cardiology, 15% nephrology, 4% pulmonology, 1% neurology
Document 2: a kidney biopsy report, 50% nephrology, 30% rheumatology, 15% infectious disease, 5% neurology
Document 3: a chest CT report, 50% pulmonology, 40% cardiology, 5% oncology, 5% infectious disease
…

Here each clinical document has multiple topics with different proportion. In this example, medical specialties can be regarded as natural topics of the document collection. If we dive into keywords of each topic, you may see something like:

Cardiology: ventricle, ekg, atrial, catheterization, cabg, edema, …
Nephrology: hemodialysis, renal, edema, nephritis, sonography, dyspnea, …
Pulmonology: dyspnea, lung, bronchitis, adenocarcinoma, egfr, edema, …
Infectious disease: hiv, hepatitis, viral, fever, leukocytosis, …
…

To be noticed, words can be shared across topics.

We use the LDA function in topicmodels package to generate an LDA topic model. The LDA function takes a document-term matrix as an input, which can be generated using tm package that we introduced at the end of the previous workshop.

# as the previous workshop, we use tm to build document-term matrix
library(tm)
library(SnowballC)
corpus <- Corpus(VectorSource(data))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, c(stopwords("english")))
corpus <- tm_map(corpus, stemDocument)

dtm <- DocumentTermMatrix(corpus)
# add the document name (just simply use the number)
rownames(dtm) <- 1:nrow(idash)

In the LDA algorithm, we need to assign the number of possible natural topics (k) to LDA function. For example, we set k = 6 to create a six-topic LDA model. You may use topics function to get the topic of each document, and apply terms function to see keywords of each topic (here we show top 30 keywords).

library(topicmodels)
# build LDA model
lda_model <- LDA(dtm, k = 6, control = list(seed = 777))
# show assigned clusters of the first 50 documents
lda_topics <- topics(lda_model, 1)
lda_topics[1:50]

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 
##  6  2  5  5  5  5  6  6  6  6  6  6  6  6  6  4  5  5  6  6  6  6  6  6  6 
## 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 
##  6  5  6  6  6  6  6  6  6  6  6  6  6  6  6  6  6  6  6  5  5  5  2  5  5

# get keywords of each LDA topic
lda_keywords <- data.frame(terms(lda_model, 30), stringsAsFactors = FALSE)
lda_keywords

##       Topic.1   Topic.2     Topic.3    Topic.4  Topic.5     Topic.6
## 1     patient   patient     patient       left  patient        left
## 2      report   histori    procedur      right  histori      arteri
## 3        time     medic      normal   unremark   normal       right
## 4     histori       day       colon     normal    heart    coronari
## 5       medic      will       right       exam     left      cathet
## 6        test      past       place     reveal     pain     patient
## 7        year    diseas         use    bilater    chest     pressur
## 8        also     daili        left       time   diseas         use
## 9      memori      also      biopsi       note     rate   pulmonari
## 10     averag      time        well    present   arteri    procedur
## 11      state      pain       scope      brain coronari      normal
## 12    depress     blood       tumor        mri    medic ventricular
## 13   function  discharg       polyp    headach     show      branch
## 14       note      deni       remov        day     test      french
## 15    perform    hospit      obtain       weak   breath     perform
## 16      evalu   present     perform       show    daili     descend
## 17       work   current    diagnosi     seizur    minut        vein
## 18 difficulti    review       small       year    blood     stenosi
## 19     attent   chronic     stomach    patient      per      diseas
## 20    problem   yearold        oper throughout   stress    anterior
## 21     disord      year       upper       gait  pressur      aortic
## 22     mother   continu        evid     reflex   atrial      vessel
## 23       rang      well        lobe   neurolog   rhythm      sheath
## 24        use     state colonoscopi     intact  cardiac       place
## 25     assess hypertens        note difficulti      ekg       remov
## 26        age   pressur     without      motor     lung      proxim
## 27       task     renal  anesthesia      prior    sinus        valv
## 28    current    recent      appear     region function       stent
## 29  treatment    normal       taken       mild    short      advanc
## 30     inform      sinc   esophagus       sign myocardi  circumflex

As you can see in lda_topics, most of first 50 documents are assigned to topic 5 or 6. From lda_keywords, topic 3 seems like gastroenterology, topic 4 might be neurology, topic 1 is psychology, topic 5 and 6 are cardiology, topic 2 is unclear.

(Optional) We may also take the advantage of tidytext package to identify how words are associated with topics and how topics are associated with documents. tidy function with the argument matrix = "beta" returns the per-topic-per-word probabilities (word-topic probabilities), called “beta”, from the LDA model. The tidy function may also gives you the per-document-per-topic probabilities (document-topic probabilities, called “gamma”) if you use the argument matrix = "gamma". For example, in this sample LDA model we see that the term aaaa belongs to topic 2 (from “beta”), and document 2 belongs to topic 2 (from “gamma”, probability > 0.9996).

library(tidytext)
beta <- tidy(lda_model, matrix = "beta")
beta[1:20, ]

## # A tibble: 20 × 3
##    topic    term          beta
##    <int>   <chr>         <dbl>
## 1      1     aaa 1.272744e-276
## 2      2     aaa  3.519668e-05
## 3      3     aaa 3.330001e-278
## 4      4     aaa 4.184585e-274
## 5      5     aaa 1.928054e-256
## 6      6     aaa 2.392378e-279
## 7      1    aaaa  5.041184e-05
## 8      2    aaaa 8.504099e-295
## 9      3    aaaa 2.317572e-298
## 10     4    aaaa 1.287829e-283
## 11     5    aaaa 2.595080e-292
## 12     6    aaaa 1.314820e-298
## 13     1   aampo  1.008237e-04
## 14     2   aampo  2.579104e-40
## 15     3   aampo 9.917867e-284
## 16     4   aampo 6.723912e-280
## 17     5   aampo 1.485989e-278
## 18     6   aampo 1.009843e-283
## 19     1 abandon  1.512347e-04
## 20     2 abandon  5.413660e-10

gamma <- tidy(lda_model, matrix = "gamma")
gamma[c(1:5, (431+1):(431+5), (431*2+1):(431*2+5), (431*3+1):(431*3+5)),]

## # A tibble: 20 × 3
##    document topic        gamma
##       <chr> <int>        <dbl>
## 1         1     1 1.813062e-04
## 2         2     1 7.982333e-05
## 3         3     1 1.299059e-04
## 4         4     1 1.140099e-04
## 5         5     1 1.614132e-04
## 6         1     2 1.813062e-04
## 7         2     2 9.996009e-01
## 8         3     2 1.299059e-04
## 9         4     2 1.140099e-04
## 10        5     2 1.614132e-04
## 11        1     3 1.813062e-04
## 12        2     3 7.982333e-05
## 13        3     3 1.299059e-04
## 14        4     3 1.140099e-04
## 15        5     3 1.614132e-04
## 16        1     4 1.813062e-04
## 17        2     4 7.982333e-05
## 18        3     4 1.299059e-04
## 19        4     4 1.140099e-04
## 20        5     4 1.614132e-04

LDA-derived topics/clusters can be used as natural labels for machine classification (supervised learning). We run a simple decision tree as an example. The accuracy is 0.7874, which is not that bad.

d <- data.frame(as.matrix(dtm))
d$label <- label

# build a decision tree
library(caret)
library(rpart)
# split the dataset based on the label (70% training, 30% testing)
set.seed(123)
inTraining <- createDataPartition(d$label, p=0.7, list=F)
training <- d[ inTraining, ]
testing <- d[-inTraining, ]

# build the decision tree
model <- rpart(label~., data=training)
#summary(model)
# plot the decision tree
plot(model, uniform=TRUE)
text(model, use.n=TRUE)

tst <- testing
tst$label = NULL
# evaluate the performance
pred <- predict(model, tst, type="class")
confusionMatrix(testing$label, pred)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction cv gi neuro psych pulmo renal
##      cv    30  1     2     0     0     1
##      gi     1 25     5     0     0     2
##      neuro  3  1    25     0     0     0
##      psych  0  0     3     6     0     0
##      pulmo  4  2     1     0     9     0
##      renal  0  0     0     0     1     5
## 
## Overall Statistics
##                                         
##                Accuracy : 0.7874        
##                  95% CI : (0.706, 0.855)
##     No Information Rate : 0.2992        
##     P-Value [Acc > NIR] : < 2.2e-16     
##                                         
##                   Kappa : 0.7273        
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: cv Class: gi Class: neuro Class: psych
## Sensitivity             0.7895    0.8621       0.6944      1.00000
## Specificity             0.9551    0.9184       0.9560      0.97521
## Pos Pred Value          0.8824    0.7576       0.8621      0.66667
## Neg Pred Value          0.9140    0.9574       0.8878      1.00000
## Prevalence              0.2992    0.2283       0.2835      0.04724
## Detection Rate          0.2362    0.1969       0.1969      0.04724
## Detection Prevalence    0.2677    0.2598       0.2283      0.07087
## Balanced Accuracy       0.8723    0.8902       0.8252      0.98760
##                      Class: pulmo Class: renal
## Sensitivity               0.90000      0.62500
## Specificity               0.94017      0.99160
## Pos Pred Value            0.56250      0.83333
## Neg Pred Value            0.99099      0.97521
## Prevalence                0.07874      0.06299
## Detection Rate            0.07087      0.03937
## Detection Prevalence      0.12598      0.04724
## Balanced Accuracy         0.92009      0.80830

Exercise

Again, we use the PhysioNet Deidentified Medical Text as an example for the exercise.

x <- readChar("id.txt", file.info("id.txt")$size)
x <- gsub("\n\n\\|\\|\\|\\|END_OF_RECORD\n\nSTART_OF_RECORD=[0-9]+\\|\\|\\|\\|[0-9]+\\|\\|\\|\\|\n", " [split] ", x)
x <- gsub("\n\n\\|\\|\\|\\|END_OF_RECORD\n\n", "", x)
x <- strsplit(x, " \\[split\\] ")
x <- x[[1]]

Try to use LDA to find the topic words (assuming that there are 5 topics), and find the topic of each document.

Answer

library(tm)
library(SnowballC)
library(topicmodels)

corpus <- Corpus(VectorSource(x))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, c(stopwords("english")))
corpus <- tm_map(corpus, stemDocument)
dtm <- DocumentTermMatrix(corpus)

lda_model <- LDA(dtm, k = 5, control = list(seed = 777))

lda_keywords <- data.frame(terms(lda_model, 10), stringsAsFactors = FALSE)
lda_keywords

##    Topic.1 Topic.2 Topic.3 Topic.4 Topic.5
## 1     will   neuro   given   neuro  remain
## 2     resp    time    vent  remain    resp
## 3  continu    wean     gtt  follow     sat
## 4   remain   sedat    resp   today   stabl
## 5     cont    cont command   thick    urin
## 6      med   chang    cont increas    note
## 7     note  secret   sound     abg continu
## 8    clear     sat    cath   place   today
## 9    stool increas    wean   lasix    cont
## 10    plan    vent  famili     sat     gtt

# use the result of LDA for supervised learning
x_df <- as.data.frame(x)
x_df$label <- topics(lda_model, 1)

Learning language model for word vector representation - word embedding

Word embedding is a natural language modeling method for word vector representation. Simply to say, it is a technique to represent words as vectors. Instead of very sparse one-hot representation or frequency count representation (bag-of-words), word embedding approach may encode each word of the large corpus into a dense vector, with semantics and linguistic regularities that bag-of-words can’t achieve. One of the most famous examples of word embedding might be the following operation:

\[ vector('paris') - vector('france') + vector('germany') = vector('berlin') \]

Please check wikipedia introduction for more background knowledge of word embedding. The most popular word embedding algorithms, word2vec and GloVe (Global Vectors for Word Representation), were developed by Mikolov at Google and Pennington at Stanford, respectively.

In Python, you can use gensim or tensorflow for word2vec implementation. Now, we’d like to introduce text2vec package, which is a R package that using C++ as the backend for GloVe implementation. text2vec will help you process raw data through the following steps:

Create a vocabulary set that we want to learn word vectors
- the words should not be too uncommon: use the term_count_min argument in prune_vocabulary function

library(text2vec)
tokens <- space_tokenizer(data)
it <- itoken(tokens, progressbar = FALSE)
vocab <- create_vocabulary(it)
vocab <- prune_vocabulary(vocab, term_count_min = 5)

Next, we construct a sparse term-cooccurence matrix (TCM)
- you may decide the size of skip grams window via skip_grams_window argument
- the concept of skip_grams_window = 5 can be imagined as: the central word can be inferred from left 5 and right 5 words

vectorizer <- vocab_vectorizer(vocab, 
                               grow_dtm = FALSE, 
                               skip_grams_window = 5)
tcm <- create_tcm(it, vectorizer)

Factorize the TCM using parallel stochastic gradient descent algorithm in GloVe, then fit the GloVe model
- word_vectors_size is an important argument that it will decide the output dimension of your word vectors, here we assign size of 200
- it will use all CPU cores on your machine but you can still specify it via RcppParallel::setThreadOptions(numThreads = CORE_NUMBER)

word_embedding_size <- 200

glove = GlobalVectors$new(word_vectors_size = word_embedding_size, vocabulary = vocab, x_max = 10)
glove$fit(tcm, n_iter = 20)

## 2017-06-27 02:30:08 - epoch 1, expected cost 0.1715

## 2017-06-27 02:30:09 - epoch 2, expected cost 0.0874

## 2017-06-27 02:30:09 - epoch 3, expected cost 0.0591

## 2017-06-27 02:30:10 - epoch 4, expected cost 0.0312

## 2017-06-27 02:30:10 - epoch 5, expected cost 0.0229

## 2017-06-27 02:30:11 - epoch 6, expected cost 0.0179

## 2017-06-27 02:30:12 - epoch 7, expected cost 0.0147

## 2017-06-27 02:30:12 - epoch 8, expected cost 0.0123

## 2017-06-27 02:30:13 - epoch 9, expected cost 0.0105

## 2017-06-27 02:30:13 - epoch 10, expected cost 0.0090

## 2017-06-27 02:30:14 - epoch 11, expected cost 0.0079

## 2017-06-27 02:30:14 - epoch 12, expected cost 0.0069

## 2017-06-27 02:30:15 - epoch 13, expected cost 0.0061

## 2017-06-27 02:30:15 - epoch 14, expected cost 0.0054

## 2017-06-27 02:30:16 - epoch 15, expected cost 0.0049

## 2017-06-27 02:30:16 - epoch 16, expected cost 0.0044

## 2017-06-27 02:30:17 - epoch 17, expected cost 0.0040

## 2017-06-27 02:30:17 - epoch 18, expected cost 0.0036

## 2017-06-27 02:30:17 - epoch 19, expected cost 0.0033

## 2017-06-27 02:30:18 - epoch 20, expected cost 0.0030

Extract the word vectors

word_vectors <- glove$get_word_vectors()

Once we get word vectors, we can try to do something similar to “paris-france+germany” operation through calculating the cosine similarity between word vectors

unknown <- word_vectors["mitral", , drop = FALSE] - 
  word_vectors["left", , drop = FALSE] + 
  word_vectors["right", , drop = FALSE]
cos_sim = sim2(x = word_vectors, y = unknown, method = "cosine", norm = "l2")

head(sort(cos_sim[, 1], decreasing = TRUE), 5)

##        mitral         valve     tricuspid regurgitation         right 
##     0.7041392     0.3972850     0.3098233     0.2951544     0.2411986

Exercise

Build the GloVe word vector model for PhysioNet Deidentified Medical Text (saved in variable x). Try to tune the skip_grams_window, word_vectors_size and n_iter to optimize your result!

Answer

library(text2vec)
tokens <- space_tokenizer(x)
it <- itoken(tokens, progressbar = FALSE)
vocab <- create_vocabulary(it)
vocab <- prune_vocabulary(vocab, term_count_min = 5)
vectorizer <- vocab_vectorizer(vocab, 
                               grow_dtm = FALSE, 
                               skip_grams_window = 5)
tcm <- create_tcm(it, vectorizer)
word_embedding_size <- 200
glove = GlobalVectors$new(word_vectors_size = word_embedding_size, vocabulary = vocab, x_max = 10)
glove$fit(tcm, n_iter = 20)

## 2017-06-27 02:30:22 - epoch 1, expected cost 0.1460

## 2017-06-27 02:30:23 - epoch 2, expected cost 0.0655

## 2017-06-27 02:30:24 - epoch 3, expected cost 0.0402

## 2017-06-27 02:30:25 - epoch 4, expected cost 0.0244

## 2017-06-27 02:30:26 - epoch 5, expected cost 0.0191

## 2017-06-27 02:30:28 - epoch 6, expected cost 0.0156

## 2017-06-27 02:30:30 - epoch 7, expected cost 0.0132

## 2017-06-27 02:30:31 - epoch 8, expected cost 0.0113

## 2017-06-27 02:30:32 - epoch 9, expected cost 0.0099

## 2017-06-27 02:30:33 - epoch 10, expected cost 0.0087

## 2017-06-27 02:30:34 - epoch 11, expected cost 0.0078

## 2017-06-27 02:30:35 - epoch 12, expected cost 0.0070

## 2017-06-27 02:30:36 - epoch 13, expected cost 0.0063

## 2017-06-27 02:30:37 - epoch 14, expected cost 0.0058

## 2017-06-27 02:30:38 - epoch 15, expected cost 0.0053

## 2017-06-27 02:30:39 - epoch 16, expected cost 0.0048

## 2017-06-27 02:30:40 - epoch 17, expected cost 0.0045

## 2017-06-27 02:30:41 - epoch 18, expected cost 0.0041

## 2017-06-27 02:30:42 - epoch 19, expected cost 0.0038

## 2017-06-27 02:30:43 - epoch 20, expected cost 0.0036

word_vectors <- glove$get_word_vectors()
unknown <- word_vectors["heart", , drop = FALSE] - 
  word_vectors["left", , drop = FALSE] + 
  word_vectors["right", , drop = FALSE]
cos_sim = sim2(x = word_vectors, y = unknown, method = "cosine", norm = "l2")
head(sort(cos_sim[, 1], decreasing = TRUE), 10)

##      heart      right       rate        fem transplant      RISS. 
##  0.7030619  0.3826807  0.3258327  0.3023557  0.2654757  0.2418842 
##     700x10   tolerate         48      130'S 
##  0.2357185  0.2250517  0.2236517  0.2235856

Learning hidden representation of deep neural network

Hidden layers of the neural network can learn an interesting respresentation of the data. We can use the hidden layer representation for many things, for example, dimensionality reduction and feature learning (though you may not understand what’s the meaning of those hidden features). In this section we’d like to stack multiple neural network layers (multilayer autoencoder) to learn the hidden representation of data.

The concepts, theories and algorithms of deep learning are not the focuses of this workshop. You may visit this Quora thread to get some thoughts about deep learning if you are not familiar with it. If you want to learn more, Stanford provides a detailed deep learning tutorial, and Goodfellow et al. also wrote a very good (but not so easy) Deep Learning textbook for extensive deep learning study.

Here we choose Keras to build deep learning model due to its simplicity. Keras is a very high-level API for neural network architecture implementation, which is originally implemented in Python (but now also in R, thanks to RStudio team). It uses Google TensorFlow (in Python) as the backend in R interface to Keras.

p.s. please rerun the following code if your R Studio can’t import Keras

devtools::install_github("rstudio/keras")
library(keras)
install_tensorflow()

In this section, we want to build a simple three layer autoencoder (only one hidden layer, with one input layer and one output layer) for encoding hidden representation. Please visit Stanford UFLDL Tutorial: Autoencoders, or read the Deep Learning Book Chapter 14: Autoencoders if you want to learn more theories about autoencoder.

Figure: An autoencoder with one hidden layer (\(L_2\)), courtesy by Stanford UFLDL Tutorial: Autoencoders

First, we define some parameters before building the deep neural network model.

The maxlen should be equal or less than your input dimension. For example, we set up 200 since that most of the nursing notes have less than 200 words.
The encoding_dim is the number of the output dimension. We choose 32 to get the 32-dimension vector for each document.
Therefore, all documents will be truncated to 200 words (at most), and the 200-dimension vectors will be encoded into 32-dimension vectors.
batch_size and epochs are training parameters, you may change them to get better results.

library(keras)

## 
## Attaching package: 'keras'

## The following objects are masked from 'package:text2vec':
## 
##     fit, normalize

maxlen <- 200
encoding_dim <- 32
batch_size <- 10
epochs <- 20

Next, we need to prepare the input data for deep neural network. Free text needs to be converted to index sequence for Keras input. text_tokenizer and fit_text_tokenizer functions help us tokenize our raw data, and texts_to_sequences function converts our textual data to a list of index sequences. Since lengths of nursing notes are all different, we use pad_sequences to pad the sequences (pad the sequences with 0 to the left), and let all notes (now index sequences) have all the same length. You may use head(data_idx) to see how the data looks like after transformation.

tok <- text_tokenizer(2000, lower = TRUE, split = " ", char_level = FALSE)
fit_text_tokenizer(tok, data)
data_idx <- texts_to_sequences(tok, data)

data_idx <- data_idx %>%
  pad_sequences(maxlen = maxlen)

head(data_idx)

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## [1,]  363  191  376    8  507  436  478    3  509     5   595    29     9
## [2,]   37  138 1293    2  558    1   11  135  814     3  1513   151    21
## [3,]   31  319  275    1 1351  337  457  109   33  1883    57   940   702
## [4,]   44   14 1273    1   11   28   43  717    3  1765    20  1351   337
## [5,]  729  521   20  337    2    3  436  478    2   506   506     4    70
## [6,]   74  431  247  484  652    2 1837  484  652     8   223  1919     7
##      [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24]
## [1,]  1360  1048    94    29    55   376     5    36   362    19   117
## [2,]   100    33   441    32   348     8   119   101    51    62  1126
## [3,]   572   817   723    29     4   117    94   134  1072    44    14
## [4,]     1   940  1566     4    20     6    79   130     3   788    99
## [5,]    20   449     2   179    20   337  1194   609   337   153     4
## [6,]     2   103     5   374    97  1239   822   773   554    64   939
##      [,25] [,26] [,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35]
## [1,]  1191   800    20     1   373  1048   297   254     8    19   117
## [2,]    66    44    55   323     8  1426   558    32    37   372   840
## [3,]  1273  1765     4    94    20  1351    29     4    14    86    67
## [4,]   409     2    20    40    64    20  1351   337     1    11     4
## [5,]    75    15  1188    13    39   378     2    34  1363  1354    79
## [6,]   241     2  1664  1220  1006   195    34  1295    30    30     2
##      [,36] [,37] [,38] [,39] [,40] [,41] [,42] [,43] [,44] [,45] [,46]
## [1,]   363   191    29     9   576    19   117   946    94    32    54
## [2,]    18   598    41    18     8  1754     2   551    18    12   162
## [3,]    78     1   153     4  1127   183     5   818     2    19    67
## [4,]     8   179    34     3  1331   386     2   484  1210   810     4
## [5,]   130     4   595  1550    79   130     4   778     1    11   109
## [6,]  1074   694    30  1004   523    40    34  1169  1797  1548  1195
##      [,47] [,48] [,49] [,50] [,51] [,52] [,53] [,54] [,55] [,56] [,57]
## [1,]    32   327    19   201   549    94    53  1947   120    22     1
## [2,]   663    36   213     2  1513    62   663    36  1221  1357   151
## [3,]    20  1351   337     1    11     4     8   179    34     3  1331
## [4,]   102     2     4  1611     5  1382   768   484  1210   810    30
## [5,]    33  1822    79   130   337   153     4    13  1123    72    77
## [6,]   671   696   520     2   402   115    40     4    42     2    42
##      [,58] [,59] [,60] [,61] [,62] [,63] [,64] [,65] [,66] [,67] [,68]
## [1,]    19   599   376   716   923    34     1    26   599     2    26
## [2,]    77  1333     9   528    22   362    21     9    46   362    21
## [3,]   386     2   484  1210   810     4   102     2     4  1611     5
## [4,]     1   656     3     1   278     4  1332     7  1376     3   362
## [5,]   712     4  1323   134     1   337   153  1382    72    77     4
## [6,]     8     3   115  1133    42   115  1411     2    42   694    64
##      [,69] [,70] [,71] [,72] [,73] [,74] [,75] [,76] [,77] [,78] [,79]
## [1,]   331   783    31    40     1   178   376    31    39     1   178
## [2,]     9    51    62   323     5   384    21   740     3     1   405
## [3,]  1382   484  1210   810   484  1210   810    30     1   656     3
## [4,]   189   362   256    94    22    29     4     6   257   180     3
## [5,]  1889  1796   520    20  1351  1072     1    11   109    33    86
## [6,]  1913     4    63    42    63     2    42   752   115    21   479
##      [,80] [,81] [,82] [,83] [,84] [,85] [,86] [,87] [,88] [,89] [,90]
## [1,]   174   376  1446     8    96   178   174   148  1297   178   174
## [2,]     4   882   212    34   629     2   327   243   537     1    11
## [3,]     1   278     4   145    22    29     4    14   588  1415    15
## [4,]   237     5  1527  1128    94     7     1   431   247     2     1
## [5,]    67   134  1072    27   227  1072  1382   457   342    31   319
## [6,]   182    30  1539   201   476     1    79   130     4   213     8
##      [,91] [,92] [,93] [,94] [,95] [,96] [,97] [,98] [,99] [,100] [,101]
## [1,]   180     9    30    34   143     8     6  1550   800      3   1975
## [2,]   100    52     6    23     3   641     2    51    62    323    279
## [3,]     1   449     2   337   484  1210   810    34     1     19    117
## [4,]   431  1373    21   250     5    36     7     1  1382    768     34
## [5,]   275     8   247   484   652   217     2   702   572    228    702
## [6,]   137     1    11     5   319   275     1    11     9    253      7
##      [,102] [,103] [,104] [,105] [,106] [,107] [,108] [,109] [,110] [,111]
## [1,]      2      6    835    800      3    752    238     64     29      9
## [2,]   1413     99     49      2    323      5    384      1     11     60
## [3,]   1048    250     31      7    433     40    768    150     96   1252
## [4,]      1     19    331    250     31      7    433     40    768    150
## [5,]    572    228     20    227   1072      1    457    109     33    841
## [6,]    319    275     22      1   1378      9      1     11    266     18
##      [,112] [,113] [,114] [,115] [,116] [,117] [,118] [,119] [,120] [,121]
## [1,]    481    992      5      2   1284      8      3    481    174      8
## [2,]     58    243    537     12      9     33   1073     20     21     56
## [3,]   1708      2     96   1579   1708    656     19    117    363    191
## [4,]     31    247    729      2     31     19    117    363    191      8
## [5,]    702    791    228    588     13   1123      1     11     28    484
## [6,]      1      9    710      1     11      9    606    252     66    382
##      [,122] [,123] [,124] [,125] [,126] [,127] [,128] [,129] [,130] [,131]
## [1,]     96    481    541     82      1    794    174    376     31      8
## [2,]     12    100    700     46    659    152    120     40    427    350
## [3,]      4    221     31      8   1297    436    478      3     20    449
## [4,]     31    247      1   1297    436    478      4    509     20    449
## [5,]   1210    810     75    106      2     70     20    449      2    179
## [6,]     34    428    223    817    484    652    457    342    201    476
##      [,132] [,133] [,134] [,135] [,136] [,137] [,138] [,139] [,140] [,141]
## [1,]   1223    794    541      8    237     90     38    161    507     90
## [2,]      1     11     16    475      6    169    659    152    120    826
## [3,]     30    145   1072     22    569    205    465    139    337      1
## [4,]     30    288   1072     22    465    139    337     34    576    457
## [5,]     20   1351   1072    810      4     75    106    503   1243      1
## [6,]      8     74    484    652      2     74    431    247    484    652
##      [,142] [,143] [,144] [,145] [,146] [,147] [,148] [,149] [,150] [,151]
## [1,]     38    363     77      9    517    507     26    201     77      3
## [2,]     99     39     21     16    754    323     64     21      9    624
## [3,]     11      4     15   1096   1784    302     34     14    457    124
## [4,]    712      5    337    254      8   1123     14    108      3     86
## [5,]     19    331      4     31      7    433     29      4     14    484
## [6,]     40    282    765     39    175    350     64    161     82   1644
##      [,152] [,153] [,154] [,155] [,156] [,157] [,158] [,159] [,160] [,161]
## [1,]     70    517    115      1     90    174    376     31      8   1223
## [2,]     37    849     18      8    362   1513     21     62    323      5
## [3,]      3    337   1469   1123     40     14     86     67      8    337
## [4,]     67     20    449     40    484   1210    810      4    576      8
## [5,]   1210   1481     78   1382    247    729    521    975     31     19
## [6,]    187     30     51     62   1126      5     46    568    157     71
##      [,162] [,163] [,164] [,165] [,166] [,167] [,168] [,169] [,170] [,171]
## [1,]     90    770    141     29      9     14   1108    980     27    280
## [2,]    372    332     82    282    765     72     41    452      2     41
## [3,]     39     96    117      8   1072     64     14    588   1415     15
## [4,]      6    257   1759    237    431    247      2    431   1579   1481
## [5,]    117    247    729      8    436    478      3    247    729      4
## [6,]    201    476     22    486     34    530      2    137    837     40
##      [,172] [,173] [,174] [,175] [,176] [,177] [,178] [,179] [,180] [,181]
## [1,]     94     70     29      9      6   1786   1947    960      3      6
## [2,]     20   1822     51     62    323    279    826     21      9      6
## [3,]      1    449      2    337    484   1210    810     82    768    150
## [4,]    254      8    431    247   1123      2    431   1579   1123     39
## [5,]     31    182     30     31    484   1210    521     22     31    712
## [6,]    933      2   1913     39    139    255    360   1088     47    918
##      [,182] [,183] [,184] [,185] [,186] [,187] [,188] [,189] [,190] [,191]
## [1,]    499      8   1510      3      1   1711    195      1    521      4
## [2,]     11      8   1203      2    226      3    924    175   1689      2
## [3,]   1252      2   1579   1708      8    656    221     31     19    117
## [4,]      1     11     28      3     20   1351    337     64    768    150
## [5,]      5   1072     34     31     19    117    247    729      8    436
## [6,]      1    430    668      2   1040     24    762      5      1     11
##      [,192] [,193] [,194] [,195] [,196] [,197] [,198] [,199] [,200]
## [1,]    624    845      2    723    206     36     46      1    521
## [2,]    512    191      8    334    273     56    328   1140    378
## [3,]    363    191      8   1297    436    478      3     20    449
## [4,]     31    247    729      2     31     19    117    363    191
## [5,]    478      3     40   1382     26    117    191      4     31
## [6,]    121      6      1     11      2      5   1242     64    137

Then we split the data into training (70%) and validation sets (30%).

inTraining <- 1:floor(nrow(data_idx) * 0.7)
x_train <- data_idx[inTraining, ]
x_test <- data_idx[-inTraining, ]

Now we can build a neural network structure with one hidden layer! We use keras_model_sequential function to initialize a new Keras model. You need to do it every time before constructing your new model.

In the R version of Keras, you can use %>% to stack the neural layer to your model. Imagine that your model is empty in the beginning, and you stack layer e1, then d1 on it. Here the e1 layer is the encoder, which inputs your list of index sequences, and output a hidden representation (with the dimension size you’ve assigned, here 32). The d1 layer is the decoder, which may take the hidden representation generated by the encoder, and try to reconstruct the original information (your input, with dimension size = maxlen).

The algorithm tries to minimize the difference (technically we call it ‘cost’) between your original input and the result of reconstruction. We use adam as an optimizer, and evaluate the cost by binary_crossentropy for this important step. There are also many different optimizers, such as sgd, rmsprop, adadelta, … that you can do experiments. Now the model is generated, and you can use summary function to see how it looks like.

# initialize the model
model <- keras_model_sequential()

model %>%
  layer_dense(name = 'e1', units = encoding_dim, activation = 'relu', input_shape = maxlen) %>%
  layer_dense(name = 'd1', units = maxlen, activation = 'sigmoid')

model %>% compile(
  loss = "binary_crossentropy",
  optimizer = "adam"
)

summary(model)

## Model
## ___________________________________________________________________________
## Layer (type)                     Output Shape                  Param #     
## ===========================================================================
## e1 (Dense)                       (None, 32)                    6432        
## ___________________________________________________________________________
## d1 (Dense)                       (None, 200)                   6600        
## ===========================================================================
## Total params: 13,032
## Trainable params: 13,032
## Non-trainable params: 0
## ___________________________________________________________________________
## 
##

Then we use our data to fit the model, x_train as training data and x_test for validation. You may change batch_size and epochs to get better results. This model fitting step may take minutes to run.

history <- model %>% fit(
  x_train, x_train,
  batch_size = batch_size,
  epochs = epochs,
  shuffle = TRUE, 
  validation_data = list(x_test, x_test)
)

plot(history)

Once the autoencoder is constructed and optimized by the data, we can now extract the hidden representation from the model. The hidden representation is the output of the encoder, which is the output of e1 layer. We can use keras_model and predict functions to get the representation and save it in the variable intermediate_output.

layer_name <- 'e1'

intermediate_layer_model <- keras_model(inputs = model$input,
                                        outputs = get_layer(model, layer_name)$output)
intermediate_output <- predict(intermediate_layer_model, x_train)
dim(intermediate_output)

## [1] 301  32

head(intermediate_output)

##          [,1]     [,2]      [,3]      [,4]      [,5]     [,6]      [,7]
## [1,]   0.0000 1639.042  84.44209 666.24622  257.7635   0.0000 1751.1661
## [2,] 567.6916 1496.325 352.89493 157.61400  730.1945 139.9446  865.7642
## [3,]   0.0000 1815.915   0.00000  12.83695 2463.7791 699.8089 1910.0526
## [4,]   0.0000 1877.227   0.00000 980.13293  478.8404 293.3696 1624.8893
## [5,]   0.0000 1405.455   0.00000 939.84064  696.5231   0.0000  484.1179
## [6,]   0.0000 2968.973 195.93867   0.00000  153.0294 320.7863 1526.2048
##      [,8] [,9]    [,10]     [,11]     [,12]     [,13] [,14]    [,15]
## [1,]    0    0 206.0261 583.49518 1752.9551 1333.0161     0 2659.139
## [2,]    0    0   0.0000 698.31891 1026.8657 2163.3354     0 2034.580
## [3,]    0    0   0.0000  24.52592 1637.7726 1677.3268     0 1796.272
## [4,]    0    0   0.0000 957.86993 2229.7930  456.5674     0 3025.994
## [5,]    0    0   0.0000 329.07123  631.8087  929.4539     0 2257.687
## [6,]    0    0   0.0000 895.51227 1405.4783  645.5356     0 1630.735
##          [,16]     [,17]    [,18] [,19]    [,20] [,21] [,22]    [,23]
## [1,]  501.3628 2401.0142 2017.931     0 1977.055     0     0 1540.770
## [2,] 1310.5140 1006.9471 1825.794     0 1338.474     0     0 2160.910
## [3,] 3033.1780  372.6192 2450.644     0 1838.193     0     0 2001.097
## [4,] 1251.8666 1938.1136 1773.842     0 1497.022     0     0 2255.430
## [5,] 1780.3651  827.5524 1466.808     0 1516.432     0     0 3713.852
## [6,] 2473.9229  909.1841 2243.909     0 1576.415     0     0 3659.577
##      [,24]    [,25]     [,26] [,27]    [,28]    [,29]     [,30]     [,31]
## [1,]     0   0.0000  386.5344     0 1204.214 745.3682 1666.7802 2144.5613
## [2,]     0 137.6203  557.5233     0 1956.551   0.0000 1791.8929 2170.5115
## [3,]     0   0.0000  490.4388     0 1863.382 380.8349 1718.9565 1824.9902
## [4,]     0   0.0000  500.6839     0 2985.656 939.2216 2760.0820 2488.1501
## [5,]     0   0.0000  368.9282     0 2841.036 434.6210 1467.6479 1506.5576
## [6,]     0   0.0000 1206.2417     0 2144.052 244.5605  773.8839  286.8441
##          [,32]
## [1,] 1824.3656
## [2,]  862.4026
## [3,] 2738.9697
## [4,]  564.1603
## [5,] 1635.9027
## [6,] 1815.2716

We can visualize how the documents located in the vector space using principal component analysis (PCA) (using princomp function, scores[, 1:2] for first two principal components).

colors <- rainbow(length(unique(as.factor(label))))

pca <- princomp(intermediate_output)$scores[, 1:2]
plot(pca, t='n', main="pca")
text(pca, labels=label, col=colors[label])

t-SNE is an awesome dimension reduction algorithm if you want to visualize how the hidden representation looks like in the vector space. The R implementation of t-SNE algorithm is in Rtsne package. Let’s see how it performs compared to PCA.

library(Rtsne)

tsne <- Rtsne(as.matrix(intermediate_output), dims = 2, perplexity = 30, 
              verbose = TRUE, max_iter = 500)

## Read the 301 x 32 data matrix successfully!
## Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
## Computing input similarities...
## Normalizing input...
## Building tree...
##  - point 0 of 301
## Done in 0.04 seconds (sparsity = 0.406684)!
## Learning embedding...
## Iteration 50: error is 60.325748 (50 iterations in 0.18 seconds)
## Iteration 100: error is 61.399892 (50 iterations in 0.17 seconds)
## Iteration 150: error is 60.608131 (50 iterations in 0.18 seconds)
## Iteration 200: error is 59.332856 (50 iterations in 0.16 seconds)
## Iteration 250: error is 58.758200 (50 iterations in 0.17 seconds)
## Iteration 300: error is 1.444816 (50 iterations in 0.14 seconds)
## Iteration 350: error is 1.164337 (50 iterations in 0.14 seconds)
## Iteration 400: error is 1.114216 (50 iterations in 0.12 seconds)
## Iteration 450: error is 1.113547 (50 iterations in 0.13 seconds)
## Iteration 500: error is 1.112694 (50 iterations in 0.12 seconds)
## Fitting performed in 1.50 seconds.

plot(tsne$Y, t='n', main="tsne")
text(tsne$Y, labels=label, col=colors[label])

Exercise

The autoencoder can be extended to not only single hidden layer. Let’s try to implement an autoencoder with 5 hidden layers instead of just one (encode/decode 3 times, respectively). The output of each layer can be a unique feature representation. However, in general, we usually pick up the middle-most layer with the smallest size of dimension as the representation we want (here the output of e3 layer).

Answer

model <- keras_model_sequential()
model %>%
  layer_dense(name = 'e1', units = 128, activation = 'relu', input_shape = maxlen) %>%
  layer_dense(name = 'e2', units = 64, activation = 'relu') %>%
  layer_dense(name = 'e3', units = encoding_dim, activation = 'relu') %>%
  layer_dense(name = 'd1', units = 64, activation = 'relu') %>%
  layer_dense(name = 'd2', units = 128, activation = 'relu') %>%
  layer_dense(name = 'd3', units = maxlen, activation = 'sigmoid')

model %>% compile(
  loss = "binary_crossentropy",
  optimizer = "adadelta"
)

summary(model)

## Model
## ___________________________________________________________________________
## Layer (type)                     Output Shape                  Param #     
## ===========================================================================
## e1 (Dense)                       (None, 128)                   25728       
## ___________________________________________________________________________
## e2 (Dense)                       (None, 64)                    8256        
## ___________________________________________________________________________
## e3 (Dense)                       (None, 32)                    2080        
## ___________________________________________________________________________
## d1 (Dense)                       (None, 64)                    2112        
## ___________________________________________________________________________
## d2 (Dense)                       (None, 128)                   8320        
## ___________________________________________________________________________
## d3 (Dense)                       (None, 200)                   25800       
## ===========================================================================
## Total params: 72,296
## Trainable params: 72,296
## Non-trainable params: 0
## ___________________________________________________________________________
## 
##

history <- model %>% fit(
  x_train, x_train,
  batch_size = batch_size,
  epochs = epochs,
  shuffle = TRUE, 
  validation_data = list(x_test, x_test)
)
plot(history)

layer_name <- 'e3'
intermediate_layer_model <- keras_model(inputs = model$input,
                                        outputs = get_layer(model, layer_name)$output)
intermediate_output <- predict(intermediate_layer_model, x_train)

tsne <- Rtsne(as.matrix(intermediate_output), dims = 2, perplexity = 30, 
              verbose = TRUE, max_iter = 500)

## Read the 301 x 32 data matrix successfully!
## Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
## Computing input similarities...
## Normalizing input...
## Building tree...
##  - point 0 of 301
## Done in 0.04 seconds (sparsity = 0.386927)!
## Learning embedding...
## Iteration 50: error is 56.005454 (50 iterations in 0.16 seconds)
## Iteration 100: error is 55.621230 (50 iterations in 0.15 seconds)
## Iteration 150: error is 54.998368 (50 iterations in 0.16 seconds)
## Iteration 200: error is 54.711908 (50 iterations in 0.17 seconds)
## Iteration 250: error is 54.685561 (50 iterations in 0.14 seconds)
## Iteration 300: error is 0.822132 (50 iterations in 0.13 seconds)
## Iteration 350: error is 0.787699 (50 iterations in 0.12 seconds)
## Iteration 400: error is 0.782643 (50 iterations in 0.12 seconds)
## Iteration 450: error is 0.767762 (50 iterations in 0.13 seconds)
## Iteration 500: error is 0.764190 (50 iterations in 0.12 seconds)
## Fitting performed in 1.41 seconds.

plot(tsne$Y, t='n', main="tsne")
text(tsne$Y, labels=label, col=colors[label])

Conclusion

Thank you for chekcing out the second workshop. We hope that you are much familiar with laguage modeling and deep learning using R.

Workshop 2 - Machine Learning-Based Approaches for Text Mining

Wei-Hung Weng

July 5, 2017

Objectives

Instructions

Machine learning approaches for text mining

Learning language model to find the theme of the document - topic modeling

Exercise

Answer

Learning language model for word vector representation - word embedding

Exercise

Answer

Learning hidden representation of deep neural network

Exercise

Answer

Conclusion