Participants may leave this workshop with skills to:
Before beginning, please test to see if the Rmd file will compile on your system by clicking the “Knit HTML button” in R studio above.
Machine learning is a powerful tool for text mining, especially finding the hidden representation, which is impossible if we just use regular expressions or simple NLP techniques.
In this workshop, we will use machine learning algorithms to perform topic modeling, word embedding and extracting a hidden representation of the text using deep autoencoder. We will use the iDASH dataset and R language for the workshop. First, we will import the iDASH data and labels (please see the previous workshop for the introduction of the dataset).
# read the iDASH data
idash <- read.table("idash.txt", sep = "\t", header = FALSE, comment.char="", stringsAsFactors = FALSE)
# store texts to "data"
data <- idash$V1
# store labels to "label"
label <- as.factor(idash$V2)
Topic modeling is an unsupervised learning technique to summarize and organize a large amount of textual information. Topic modeling algorithms can help us identify the words/terms which can be grouped into the same cluster (refer to “topic”) from a collection of documents, discover hidden topics among the documents, annotate/label the documents by the identified groups (topics), and therefore understand, summarize, separate, and organize bunch of textual data. In other words, topic modeling may help us find out a hidden thematic representation of the document.
We choose the Latent Dirichlet allocation (LDA) algorithm, which is a common and popular mathematical topic modeling method developed by Prof. David Blei, for topic modeling. LDA can estimate both of the followings at the same time: (1) each word in the document collection is attributable to one of the document’s topics, and (2) each document (a set of words) is viewed as a mixture of topics that are present in the document collection. For example:
Here each clinical document has multiple topics with different proportion. In this example, medical specialties can be regarded as natural topics of the document collection. If we dive into keywords of each topic, you may see something like:
To be noticed, words can be shared across topics.
We use the LDA
function in topicmodels
package to generate an LDA topic model. The LDA
function takes a document-term matrix as an input, which can be generated using tm
package that we introduced at the end of the previous workshop.
# as the previous workshop, we use tm to build document-term matrix
library(tm)
library(SnowballC)
corpus <- Corpus(VectorSource(data))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, c(stopwords("english")))
corpus <- tm_map(corpus, stemDocument)
dtm <- DocumentTermMatrix(corpus)
# add the document name (just simply use the number)
rownames(dtm) <- 1:nrow(idash)
In the LDA algorithm, we need to assign the number of possible natural topics (k
) to LDA
function. For example, we set k = 6
to create a six-topic LDA model. You may use topics
function to get the topic of each document, and apply terms
function to see keywords of each topic (here we show top 30 keywords).
library(topicmodels)
# build LDA model
lda_model <- LDA(dtm, k = 6, control = list(seed = 777))
# show assigned clusters of the first 50 documents
lda_topics <- topics(lda_model, 1)
lda_topics[1:50]
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## 6 2 5 5 5 5 6 6 6 6 6 6 6 6 6 4 5 5 6 6 6 6 6 6 6
## 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
## 6 5 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 5 5 5 2 5 5
# get keywords of each LDA topic
lda_keywords <- data.frame(terms(lda_model, 30), stringsAsFactors = FALSE)
lda_keywords
## Topic.1 Topic.2 Topic.3 Topic.4 Topic.5 Topic.6
## 1 patient patient patient left patient left
## 2 report histori procedur right histori arteri
## 3 time medic normal unremark normal right
## 4 histori day colon normal heart coronari
## 5 medic will right exam left cathet
## 6 test past place reveal pain patient
## 7 year diseas use bilater chest pressur
## 8 also daili left time diseas use
## 9 memori also biopsi note rate pulmonari
## 10 averag time well present arteri procedur
## 11 state pain scope brain coronari normal
## 12 depress blood tumor mri medic ventricular
## 13 function discharg polyp headach show branch
## 14 note deni remov day test french
## 15 perform hospit obtain weak breath perform
## 16 evalu present perform show daili descend
## 17 work current diagnosi seizur minut vein
## 18 difficulti review small year blood stenosi
## 19 attent chronic stomach patient per diseas
## 20 problem yearold oper throughout stress anterior
## 21 disord year upper gait pressur aortic
## 22 mother continu evid reflex atrial vessel
## 23 rang well lobe neurolog rhythm sheath
## 24 use state colonoscopi intact cardiac place
## 25 assess hypertens note difficulti ekg remov
## 26 age pressur without motor lung proxim
## 27 task renal anesthesia prior sinus valv
## 28 current recent appear region function stent
## 29 treatment normal taken mild short advanc
## 30 inform sinc esophagus sign myocardi circumflex
As you can see in lda_topics
, most of first 50 documents are assigned to topic 5 or 6. From lda_keywords
, topic 3 seems like gastroenterology, topic 4 might be neurology, topic 1 is psychology, topic 5 and 6 are cardiology, topic 2 is unclear.
(Optional) We may also take the advantage of tidytext
package to identify how words are associated with topics and how topics are associated with documents. tidy
function with the argument matrix = "beta"
returns the per-topic-per-word probabilities (word-topic probabilities), called “beta”, from the LDA model. The tidy
function may also gives you the per-document-per-topic probabilities (document-topic probabilities, called “gamma”) if you use the argument matrix = "gamma"
. For example, in this sample LDA model we see that the term aaaa
belongs to topic 2 (from “beta”), and document 2 belongs to topic 2 (from “gamma”, probability > 0.9996).
library(tidytext)
beta <- tidy(lda_model, matrix = "beta")
beta[1:20, ]
## # A tibble: 20 × 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 aaa 1.272744e-276
## 2 2 aaa 3.519668e-05
## 3 3 aaa 3.330001e-278
## 4 4 aaa 4.184585e-274
## 5 5 aaa 1.928054e-256
## 6 6 aaa 2.392378e-279
## 7 1 aaaa 5.041184e-05
## 8 2 aaaa 8.504099e-295
## 9 3 aaaa 2.317572e-298
## 10 4 aaaa 1.287829e-283
## 11 5 aaaa 2.595080e-292
## 12 6 aaaa 1.314820e-298
## 13 1 aampo 1.008237e-04
## 14 2 aampo 2.579104e-40
## 15 3 aampo 9.917867e-284
## 16 4 aampo 6.723912e-280
## 17 5 aampo 1.485989e-278
## 18 6 aampo 1.009843e-283
## 19 1 abandon 1.512347e-04
## 20 2 abandon 5.413660e-10
gamma <- tidy(lda_model, matrix = "gamma")
gamma[c(1:5, (431+1):(431+5), (431*2+1):(431*2+5), (431*3+1):(431*3+5)),]
## # A tibble: 20 × 3
## document topic gamma
## <chr> <int> <dbl>
## 1 1 1 1.813062e-04
## 2 2 1 7.982333e-05
## 3 3 1 1.299059e-04
## 4 4 1 1.140099e-04
## 5 5 1 1.614132e-04
## 6 1 2 1.813062e-04
## 7 2 2 9.996009e-01
## 8 3 2 1.299059e-04
## 9 4 2 1.140099e-04
## 10 5 2 1.614132e-04
## 11 1 3 1.813062e-04
## 12 2 3 7.982333e-05
## 13 3 3 1.299059e-04
## 14 4 3 1.140099e-04
## 15 5 3 1.614132e-04
## 16 1 4 1.813062e-04
## 17 2 4 7.982333e-05
## 18 3 4 1.299059e-04
## 19 4 4 1.140099e-04
## 20 5 4 1.614132e-04
LDA-derived topics/clusters can be used as natural labels for machine classification (supervised learning). We run a simple decision tree as an example. The accuracy is 0.7874, which is not that bad.
d <- data.frame(as.matrix(dtm))
d$label <- label
# build a decision tree
library(caret)
library(rpart)
# split the dataset based on the label (70% training, 30% testing)
set.seed(123)
inTraining <- createDataPartition(d$label, p=0.7, list=F)
training <- d[ inTraining, ]
testing <- d[-inTraining, ]
# build the decision tree
model <- rpart(label~., data=training)
#summary(model)
# plot the decision tree
plot(model, uniform=TRUE)
text(model, use.n=TRUE)
tst <- testing
tst$label = NULL
# evaluate the performance
pred <- predict(model, tst, type="class")
confusionMatrix(testing$label, pred)
## Confusion Matrix and Statistics
##
## Reference
## Prediction cv gi neuro psych pulmo renal
## cv 30 1 2 0 0 1
## gi 1 25 5 0 0 2
## neuro 3 1 25 0 0 0
## psych 0 0 3 6 0 0
## pulmo 4 2 1 0 9 0
## renal 0 0 0 0 1 5
##
## Overall Statistics
##
## Accuracy : 0.7874
## 95% CI : (0.706, 0.855)
## No Information Rate : 0.2992
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7273
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: cv Class: gi Class: neuro Class: psych
## Sensitivity 0.7895 0.8621 0.6944 1.00000
## Specificity 0.9551 0.9184 0.9560 0.97521
## Pos Pred Value 0.8824 0.7576 0.8621 0.66667
## Neg Pred Value 0.9140 0.9574 0.8878 1.00000
## Prevalence 0.2992 0.2283 0.2835 0.04724
## Detection Rate 0.2362 0.1969 0.1969 0.04724
## Detection Prevalence 0.2677 0.2598 0.2283 0.07087
## Balanced Accuracy 0.8723 0.8902 0.8252 0.98760
## Class: pulmo Class: renal
## Sensitivity 0.90000 0.62500
## Specificity 0.94017 0.99160
## Pos Pred Value 0.56250 0.83333
## Neg Pred Value 0.99099 0.97521
## Prevalence 0.07874 0.06299
## Detection Rate 0.07087 0.03937
## Detection Prevalence 0.12598 0.04724
## Balanced Accuracy 0.92009 0.80830
Again, we use the PhysioNet Deidentified Medical Text as an example for the exercise.
x <- readChar("id.txt", file.info("id.txt")$size)
x <- gsub("\n\n\\|\\|\\|\\|END_OF_RECORD\n\nSTART_OF_RECORD=[0-9]+\\|\\|\\|\\|[0-9]+\\|\\|\\|\\|\n", " [split] ", x)
x <- gsub("\n\n\\|\\|\\|\\|END_OF_RECORD\n\n", "", x)
x <- strsplit(x, " \\[split\\] ")
x <- x[[1]]
Try to use LDA to find the topic words (assuming that there are 5 topics), and find the topic of each document.
library(tm)
library(SnowballC)
library(topicmodels)
corpus <- Corpus(VectorSource(x))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, c(stopwords("english")))
corpus <- tm_map(corpus, stemDocument)
dtm <- DocumentTermMatrix(corpus)
lda_model <- LDA(dtm, k = 5, control = list(seed = 777))
lda_keywords <- data.frame(terms(lda_model, 10), stringsAsFactors = FALSE)
lda_keywords
## Topic.1 Topic.2 Topic.3 Topic.4 Topic.5
## 1 will neuro given neuro remain
## 2 resp time vent remain resp
## 3 continu wean gtt follow sat
## 4 remain sedat resp today stabl
## 5 cont cont command thick urin
## 6 med chang cont increas note
## 7 note secret sound abg continu
## 8 clear sat cath place today
## 9 stool increas wean lasix cont
## 10 plan vent famili sat gtt
# use the result of LDA for supervised learning
x_df <- as.data.frame(x)
x_df$label <- topics(lda_model, 1)
Word embedding is a natural language modeling method for word vector representation. Simply to say, it is a technique to represent words as vectors. Instead of very sparse one-hot representation or frequency count representation (bag-of-words), word embedding approach may encode each word of the large corpus into a dense vector, with semantics and linguistic regularities that bag-of-words can’t achieve. One of the most famous examples of word embedding might be the following operation:
\[ vector('paris') - vector('france') + vector('germany') = vector('berlin') \]
Please check wikipedia introduction for more background knowledge of word embedding. The most popular word embedding algorithms, word2vec and GloVe (Global Vectors for Word Representation), were developed by Mikolov at Google and Pennington at Stanford, respectively.
In Python, you can use gensim
or tensorflow
for word2vec implementation. Now, we’d like to introduce text2vec
package, which is a R package that using C++ as the backend for GloVe implementation. text2vec
will help you process raw data through the following steps:
Create a vocabulary set that we want to learn word vectors
term_count_min
argument in prune_vocabulary
functionlibrary(text2vec)
tokens <- space_tokenizer(data)
it <- itoken(tokens, progressbar = FALSE)
vocab <- create_vocabulary(it)
vocab <- prune_vocabulary(vocab, term_count_min = 5)
Next, we construct a sparse term-cooccurence matrix (TCM)
skip_grams_window
argumentskip_grams_window = 5
can be imagined as: the central word can be inferred from left 5 and right 5 wordsvectorizer <- vocab_vectorizer(vocab,
grow_dtm = FALSE,
skip_grams_window = 5)
tcm <- create_tcm(it, vectorizer)
Factorize the TCM using parallel stochastic gradient descent algorithm in GloVe, then fit the GloVe model
word_vectors_size
is an important argument that it will decide the output dimension of your word vectors, here we assign size of 200RcppParallel::setThreadOptions(numThreads = CORE_NUMBER)
word_embedding_size <- 200
glove = GlobalVectors$new(word_vectors_size = word_embedding_size, vocabulary = vocab, x_max = 10)
glove$fit(tcm, n_iter = 20)
## 2017-06-27 02:30:08 - epoch 1, expected cost 0.1715
## 2017-06-27 02:30:09 - epoch 2, expected cost 0.0874
## 2017-06-27 02:30:09 - epoch 3, expected cost 0.0591
## 2017-06-27 02:30:10 - epoch 4, expected cost 0.0312
## 2017-06-27 02:30:10 - epoch 5, expected cost 0.0229
## 2017-06-27 02:30:11 - epoch 6, expected cost 0.0179
## 2017-06-27 02:30:12 - epoch 7, expected cost 0.0147
## 2017-06-27 02:30:12 - epoch 8, expected cost 0.0123
## 2017-06-27 02:30:13 - epoch 9, expected cost 0.0105
## 2017-06-27 02:30:13 - epoch 10, expected cost 0.0090
## 2017-06-27 02:30:14 - epoch 11, expected cost 0.0079
## 2017-06-27 02:30:14 - epoch 12, expected cost 0.0069
## 2017-06-27 02:30:15 - epoch 13, expected cost 0.0061
## 2017-06-27 02:30:15 - epoch 14, expected cost 0.0054
## 2017-06-27 02:30:16 - epoch 15, expected cost 0.0049
## 2017-06-27 02:30:16 - epoch 16, expected cost 0.0044
## 2017-06-27 02:30:17 - epoch 17, expected cost 0.0040
## 2017-06-27 02:30:17 - epoch 18, expected cost 0.0036
## 2017-06-27 02:30:17 - epoch 19, expected cost 0.0033
## 2017-06-27 02:30:18 - epoch 20, expected cost 0.0030
word_vectors <- glove$get_word_vectors()
Once we get word vectors, we can try to do something similar to “paris-france+germany” operation through calculating the cosine similarity between word vectors
unknown <- word_vectors["mitral", , drop = FALSE] -
word_vectors["left", , drop = FALSE] +
word_vectors["right", , drop = FALSE]
cos_sim = sim2(x = word_vectors, y = unknown, method = "cosine", norm = "l2")
head(sort(cos_sim[, 1], decreasing = TRUE), 5)
## mitral valve tricuspid regurgitation right
## 0.7041392 0.3972850 0.3098233 0.2951544 0.2411986
Build the GloVe word vector model for PhysioNet Deidentified Medical Text (saved in variable x
). Try to tune the skip_grams_window
, word_vectors_size
and n_iter
to optimize your result!
library(text2vec)
tokens <- space_tokenizer(x)
it <- itoken(tokens, progressbar = FALSE)
vocab <- create_vocabulary(it)
vocab <- prune_vocabulary(vocab, term_count_min = 5)
vectorizer <- vocab_vectorizer(vocab,
grow_dtm = FALSE,
skip_grams_window = 5)
tcm <- create_tcm(it, vectorizer)
word_embedding_size <- 200
glove = GlobalVectors$new(word_vectors_size = word_embedding_size, vocabulary = vocab, x_max = 10)
glove$fit(tcm, n_iter = 20)
## 2017-06-27 02:30:22 - epoch 1, expected cost 0.1460
## 2017-06-27 02:30:23 - epoch 2, expected cost 0.0655
## 2017-06-27 02:30:24 - epoch 3, expected cost 0.0402
## 2017-06-27 02:30:25 - epoch 4, expected cost 0.0244
## 2017-06-27 02:30:26 - epoch 5, expected cost 0.0191
## 2017-06-27 02:30:28 - epoch 6, expected cost 0.0156
## 2017-06-27 02:30:30 - epoch 7, expected cost 0.0132
## 2017-06-27 02:30:31 - epoch 8, expected cost 0.0113
## 2017-06-27 02:30:32 - epoch 9, expected cost 0.0099
## 2017-06-27 02:30:33 - epoch 10, expected cost 0.0087
## 2017-06-27 02:30:34 - epoch 11, expected cost 0.0078
## 2017-06-27 02:30:35 - epoch 12, expected cost 0.0070
## 2017-06-27 02:30:36 - epoch 13, expected cost 0.0063
## 2017-06-27 02:30:37 - epoch 14, expected cost 0.0058
## 2017-06-27 02:30:38 - epoch 15, expected cost 0.0053
## 2017-06-27 02:30:39 - epoch 16, expected cost 0.0048
## 2017-06-27 02:30:40 - epoch 17, expected cost 0.0045
## 2017-06-27 02:30:41 - epoch 18, expected cost 0.0041
## 2017-06-27 02:30:42 - epoch 19, expected cost 0.0038
## 2017-06-27 02:30:43 - epoch 20, expected cost 0.0036
word_vectors <- glove$get_word_vectors()
unknown <- word_vectors["heart", , drop = FALSE] -
word_vectors["left", , drop = FALSE] +
word_vectors["right", , drop = FALSE]
cos_sim = sim2(x = word_vectors, y = unknown, method = "cosine", norm = "l2")
head(sort(cos_sim[, 1], decreasing = TRUE), 10)
## heart right rate fem transplant RISS.
## 0.7030619 0.3826807 0.3258327 0.3023557 0.2654757 0.2418842
## 700x10 tolerate 48 130'S
## 0.2357185 0.2250517 0.2236517 0.2235856
Thank you for chekcing out the second workshop. We hope that you are much familiar with laguage modeling and deep learning using R.