Measuring the Comprehension Burden of Text Content

A Case Study

What is the business problem?

Teachers and educators have long struggled with large class sizes for students of varying academic levels. Not only is it difficult to appropriately allocate classroom time to students of all reading levels, but it is also challenging to find enough reading material to accommodate all students. Although there exists a loose classification system that labels some literature by grade-level, it is not uncommon that many students in a class have reading abilities that fall above or below their designated grade-level. Therefore, it does not suffice for a 5th grade English teacher to assign all of her students Katherine Patterson’s Bridge to Teribithia, for example. To unload the teacher’s burden to provide individually matched reading materials to students, a machine learning algorithm can be used to classify various texts appropriately.

How can Machine Learning help?

Text mining and natural language processesing (NLP) are often used to classify documents (e.g. classifying news articles by topic). Assessing the reading comprehension level of the target audience is a bit more challenging. A number of readability scores attempt to quantify reading level. These scores are calculated by assigning a difficulty value to each word that effectively measures how rare and sophisticated individual words are. By summing the difficulty values of all words in a document, we can obtain a single score of the document. We can also analyze how many conjugations, pronouns, subordinations and other parts of speech are contained in the document.

What have we done at Numtra?

Gephi, a graphical interface was first used to model the similarity of documents in the corpus. Documents by the same author are clustered together and these clusters are more closely connected to other literature clusters read, “liked” or purchased by the same reader group. Amazon, for example, recommends that Harry Potter fans buy Marc Secchia’s Dragonfriends. Our Gephi graph would show J.K. Rowling’s cluster of books adjacent to Marc Secchia’s.

After text was extracted from the documents in the corpus, machine learning techniques were used to build classification models. The Quadratyx NLP toolkit which takes into account readability score, was used to build n-gram and skip gram based models for text content. Word associations were also studied. It was found that documents with many complex word associations performed better under models with a high number of n-grams. Term and token weighting algorithms were also used to appropriately emphasize the importance of the complex word associations.

Ultimately, the literature was classified by reading-level, a pseudo-binning of the readability-score. Students were then asked to take a standardized reading test to measure reading comprehension. The model outputs, which had previously been stored on a distributed file system, could be easily accessed by teachers and educators to match individual students with a recommended reading list.