How to tackle a real-world problem with GuidedLDA

Shahrzad Hosseini
Insight
Published in
9 min readNov 14, 2019

--

Snapshot of interactive visualization of the topics identified by Guided LDA and the keywords in each topic (pyLDAvis)

Originally posted on Analytics Vidhya.

The prevalent use of online platforms for interaction and the large size of the text data from users’ input makes digesting data increasingly more time-consuming. Sown to Grow is an online educational company with the aim to empower students by providing a platform for setting goals, reflecting on strategies, and interacting with their teachers. For this company to scale up across the US, automated parsing of reflections is necessary. It helps teachers customize the feedback and channel limited resources to vulnerable kids.

Data

The company shared 180k students’ reflections that, based on the company’s rubric system, were considered high quality (having a strategy/strategies). The actual data cannot be shown due to privacy reasons, but my dataframe looked like this:

content               index0                  reflection            0 1                  reflection            1...184835             reflection            184835

After cleaning the data, which included removing duplicates, non-related content, and non-English content, I ended up with 104k reflections that I used to identify the strategies. Below is the function I used to correct misspelled words:

from enchant.checker import SpellCheckerdef spell_check(text):       
'''
spell_check: function for correcting the spelling of the reflections
Expects: a string
Returns: a list
'''
Corr_RF = []
#Grab each individual reflection
for refl in text.split():
#Check to see if the words are in the dictionary
chkr = SpellChecker("en_US", refl)
for err in chkr:
#for the identified errors or words not in dictionary get the suggested correction
#and replace it in the reflection string
if len(err.suggest()) > 0:
sug = err.suggest()[0]
err.replace(sug)
Corr_RF.append(chkr.get_text())
#return the dataframe with the new corrected reflection column
return ' '.join(Corr_RF)
data['Corrected_content'] = data.content.apply(spell_check)document = data #to change the name of the dataframe to documents

To remove the non-English content, I used langdetect to tag the language of the text and remove those that were non-English. Langdetect is pretty accurate when the input is a sentence, but less so when entering just a word.

from langdetect import detectdef lang_detect(text):
'''
lang_detect: function for detecting the language of the reflections
Expects: a string
Returns: a list of the detected languages
'''
lang = []
for refl in text:
lang.append(detect(refl))
return lang

Initial strategies to solve the problem

Regular LDA

I began to model topics in the reflections using the Gensim Topic modeling package through Latent Dirichlet Allocation (LDA). To prepare the data for topic modeling, I tokenized (split the document into sentences and sentences into words), removed punctuation and made them lower-case. Words smaller than three characters were also removed. These were all done using the Gensim simple preprocessing module. I then defined the function to change words from third-person to first-person, and verbs from past and future tenses to present tense. Following that, words were reduced to their root form (stem and lemmatizing).

import gensimfrom gensim.utils import simple_preprocessfrom gensim.parsing.preprocessing import STOPWORDSfrom nltk.stem import WordNetLemmatizer, SnowballStemmerfrom nltk.stem.porter import *from nltk.corpus import wordnetimport numpy as npnp.random.seed(42)

After importing the necessary packages and modules, it was time for some preprocessing as explained before:

def lemmatize_stemming(text):
stemmer = SnowballStemmer('english')
return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
def preprocess(text):
result = []
for token in gensim.utils.simple_preprocess(text):
if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
result.append(lemmatize_stemming(token))
return result
processed_docs = documents['content'].map(preprocess)

The example below shows the result of preprocessing (I have used a hypothetical example):

doc_sample = documents[documents['index'] == 34].values[0][0]
print('original document: ')
words = []
for word in doc_sample.split(' '):
words.append(word)
print(words)
print('\n\n tokenized and lemmatized document: ')
print(preprocess(doc_sample))

Example:

original document:['Something', 'I', 'think', 'I', 'have', 'done', 'correct', 'is', 'studying', 'in', 'advance.']tokenized and lemmatized document:['think', 'correct', 'studi', 'advanc']

To create a bag of words on the dataset, Gensim dictionary can be used. Bag of words is simply a dictionary from ‘processed_docs’ containing the number of times a word appears (word count) in the whole document (corpora).

dictionary = gensim.corpora.Dictionary(processed_docs)
count = 0
for k, v in dictionary.iteritems():
print(k, v)
count += 1
if count > 10:
break

Remove the tokens that appear in less than 15 documents and above the 0.5 document (fraction of the total document, not absolute value). After that , keep the 100000 most frequent tokens.

dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

I created a dictionary that shows the words, and the number of times those words appear in each document, and saved them as bow_corpus:

bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

Now, the data is ready to run LDA topic model. I used Gensim LDA with the capability of running on multiple cores.

lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=7, id2word=dictionary, passes=2, workers=2)

Check the words for each topic and its relative weight:

for idx, topic in lda_model.print_topics(-1):
print('Topic: {} \nWords: {}'.format(idx, topic))
Topic: 0Words: 0.046*"time" + 0.044*"read" + 0.041*"week" + 0.030*"work" + 0.024*"studi" + 0.022*"go" + 0.016*"good" + 0.016*"book" + 0.015*"like" + 0.014*"test"Topic: 1Words: 0.055*"read" + 0.036*"question" + 0.034*"answer" + 0.025*"time" + 0.018*"text" + 0.017*"strategi" + 0.017*"work" + 0.016*"think" + 0.014*"go" + 0.014*"look"Topic: 2Words: 0.037*"need" + 0.021*"work" + 0.018*"word" + 0.018*"write" + 0.015*"time" + 0.015*"complet" + 0.015*"essay" + 0.014*"goal" + 0.013*"help" + 0.012*"finish"Topic: 3Words: 0.042*"note" + 0.041*"help" + 0.032*"studi" + 0.029*"understand" + 0.027*"quiz" + 0.024*"question" + 0.021*"time" + 0.016*"better" + 0.014*"take" + 0.014*"test"Topic: 4Words: 0.031*"write" + 0.031*"work" + 0.027*"time" + 0.025*"think" + 0.024*"sure" + 0.019*"check" + 0.017*"thing" + 0.017*"strategi" + 0.014*"question" + 0.014*"help"Topic: 5Words: 0.058*"work" + 0.057*"grade" + 0.046*"goal" + 0.033*"class" + 0.027*"week" + 0.022*"math" + 0.017*"scienc" + 0.016*"improv" + 0.016*"want" + 0.016*"finish"

As you can see from the words in each topic, some are shared between topics and there is not a distinct topic that can be tagged for each group of words.

Part of Speech (POS) Tagging

After LDA, I decided to tag the part of speech (POS) for each reflection and extract the verbs. I assumed students are reflecting on what they did, so reflections that have verbs in past tense could give me clue of the topics for learning strategies (e.g. I studied my notes and practiced the past exams). I parsed the reflection and extracted all of the verbs used in the reflections via POS tagging. Then, I looked for the verb tense to identify the relation between reflections having a learning strategy and the tense of the verbs used in it. I noticed there are reflections that clearly have learning strategies and are not necessarily in past tense.

Pipeline used for solving this problem

Guided LDA

This did not help me find the distinct topics of learning strategies. However, both LDA and POS gave me an idea to use GuidedLDA (Github repo), a semi-supervised learning algorithm. The idea was to set some seed words for topics that users believe are representative of the underlying topics in the corpus, and guide the model to converge around those terms. I used a Python implementation of the algorithm explained in the paper by J. Jagarlamudi, H. Daume III and R. Udupa “Incorporating Lexical Priors into Topic Models.” The paper talks about how the priors (in this case priors mean seeded words) can be set into the model to guide it in a certain direction.

In regular LDA, each word is first randomly assigned to a topic controlled with Dirichlet priors via the Alpha parameter (now you know where LDA gets its name from). The next step is identifying which term belongs to which topic. LDA uses a very simple approach by finding the topic for one term at a time.

Let’s assume we want to find the topic for the word ‘study’. LDA first assumes that every other word in the corpus is assigned to the right topic. In the last step, each word is uniformly distributed in all topics and it is assumed that it’s the correct topic for those words. Then LDA computes which words, ‘study’ is frequently paired with, then, which is the most common topic among those terms. We will assign ‘study’ to that topic, and ‘study’ will probably go near whichever topic ‘textbook’ and ‘notes’ are in. These three words are now closer to each other than they were prior to this step. The model then moves on to next word, and repeats the process as many number of times as needed to converge. With guided LDA, we explicitly want the model to converge in a way that the words ‘study’ and ‘textbook’ are in one topic. To do so, GuidedLDA gives some extra boost to ‘study’ and ‘textbook’ to lie in a specific topic. In this algorithm, the parameter of how much extra boost should be given to a word is controlled by seed_confidence, and it can be a range between 0 and 1. With a seed_confidence of 0.1, you can bias the seeded words by 10% more towards the seeded topics.

To use python implementation of GuidedLDA you can:

pip install guidedlda

or

https://github.com/vi3k6i5/GuidedLDA
cd GuidedLDA
sh build_dist.sh
python setup.py sdist
pip install -e .

Start GuidedLDA by preprocessing the data, as you do with any NLP work. For that, I defined my own preprocessing functions:

def get_wordnet_pos(word):'''tags parts of speech to tokens
Expects a string and outputs the string and
its part of speech'''

tag = nltk.pos_tag([word])[0][1][0].upper()
tag_dict = {"J": wordnet.ADJ,
"N": wordnet.NOUN,
"V": wordnet.VERB,
"R": wordnet.ADV}
return tag_dict.get(tag, wordnet.NOUN)
def word_lemmatizer(text):'''lemamtizes the tokens based on their part of speech'''

lemmatizer = WordNetLemmatizer()
text = lemmatizer.lemmatize(text, get_wordnet_pos(text))
return text
def reflection_tokenizer(text):'''expects a string an returns a list of lemmatized tokens
and removes the stop words. Tokens are lower cased and
non- alphanumeric characters as well as numbers removed. '''
text=re.sub(r'[\W_]+', ' ', text) #keeps alphanumeric characters
text=re.sub(r'\d+', '', text) #removes numbers
text = text.lower()
tokens = [word for word in word_tokenize(text)]
tokens = [word for word in tokens if len(word) >= 3]
#removes smaller than 3 character
tokens = [word_lemmatizer(w) for w in tokens]
tokens = [s for s in tokens if s not in stop_words]
return tokens

After defining all the necessary functions for preprocessing, it is time to apply it to the target column (here, corrected_content) of the dataframe, and save it as new column, ‘lemmatized_tokens’.

df['lemmatize_token'] = df.corrected_content.apply(reflection_tokenizer)

Next, to generate term-document matrix, I used CountVectorizer class from scikit learn package:

from sklearn.feature_extraction.text import CountVectorizer

First, we need to instantiate CountVectorizer. For the full list of the parameters you can refer to scikit learn website. I changed the tokenizer to the customized one I previously defined, and the stop words to the list that I had created based on my own dataset. Here, I have used n-gram range of 4 words. Now, it is time to fit and transform the corpus to generate the term-document matrix:

token_vectorizer = CountVectorizer(tokenizer = reflection_tokenizer, min_df=10, stop_words=stop_words, ngram_range=(1, 4))X = token_vectorizer.fit_transform(df.corrected_content)

To model the topics with GuidedLDA after importing the package, a dictionary of the terms is created.

import guidedldatf_feature_names = token_vectorizer.get_feature_names()word2id = dict((v, idx) for idx, v in enumerate(tf_feature_names))

I then provided a list of seed words for the model. I used the semantic of the text along with initial keywords I got from LDA modeling and the dictionary of verbs from POS. For that, I created a list of lists in which each list included the keywords that I wanted grouped under a specific topic.

seed_topic_list= [['take', 'note', 'compare', 'classmate', 'highlight', 'underline', 'jot', 'write', 'topic', 'main', 'complete', 'point', 'copy', 'slide'],['read', 'study', 'review', 'skim', 'textbook', 'compare', 'note', 'connect', 'sketch', 'summarize', 'relationship', 'map', 'concept', 'diagram', 'chart'],['question', 'essay', 'assignment', 'exam', 'test', 'quiz', 'answer', 'practice', 'review', 'repeat', 'strength', 'weak', 'solve', 'problem', 'identify'],['plan', 'calendar', 'time', 'task', 'list', 'manage', 'procrastinate', 'due', 'stress', 'manage', 'anxiety', 'express', 'break', 'sleep', 'nap', 'eat', 'exercise'],['group', 'partner', 'classmate', 'brainstorm', 'ask', 'answer', 'verify', 'peer', 'teach', 'clarify'],['ask','aid', 'resource', 'teacher', 'tutor', 'peer', 'verify', 'explain', 'clear', 'talk']]

As you can see, I provided the model with seed words for 6 topics.

model = guidedlda.GuidedLDA(n_topics=6, n_iter=100, random_state=7, refresh=10)
seed_topics = {}
for t_id, st in enumerate(seed_topic_list):
for word in st:
seed_topics[word2id[word]] = t_id
model.fit(X, seed_topics=seed_topics, seed_confidence=0.15)

Checking the words for each topics:

n_top_words = 15
topic_word = model.topic_word_
for i, topic_dist in enumerate(topic_word):
topic_words = np.array(tf_feature_names)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
print('Topic {}: {}'.format(i, ' '.join(topic_words)))

The results looks like:

Topic 0: write time reading book know essay start idea take people read keep focus first completeTopic 1: read study time note take test reading quiz question book look understand day word reviewTopic 2: question time study quiz understand check problem note answer knowledge take practice ask mistake learnTopic 3: time finish assignment homework complete study reflection school day test quiz home keep win lastTopic 4: question answer time read look text reading evidence write find understand word know back rightTopic 5: ask finish teacher talk time school stay attention pay focus extra test pay attention homework know

To visualize the data, I used pyLDAvis package’s powerful interactive visualization and the result is below. Six topics are distinctly separated and the theme from each topic can be grouped as:

  1. Finish homework/complete assignment

2. Check past quizzes and questions/understand answers

3. Talking and asking teacher/pay attention

4. Read/study notes and books

5. Answering questions and learn the problems

6. Write stories, essay and book

Source code can be found on GitRepo. I look forward to hearing any feedback or questions.

Shahrzad Hosseini developed Reflectometer during her time as an Insight Data Science Fellow in 2019.

Are you interested in working on high-impact projects and transitioning to a career in data? Sign up to learn more about the Insight Fellows programs and start your application today.

--

--