In last Sunday’s New York Times Book Review (June 26, 2011), Kathryn Schultz published an article, “Distant Reading,” that covers the Stanford Literary Laboratory’s initiative to feed novels into computer programs to determine their genre. Crazy you say? The founder of the lab, Franco Moretti, has an interesting premise. Given any literary genre, you can’t possibly read all books of the genre to develop an appreciation of it underlying traits. George Eliot wrote Victorian fictions for instance. Moretti mentions there 60,000 novels published in 19th-century England. You can’t read them all; there are just too many books and not enough time. So Moretti advocates what he calls distant reading. Recently, his team fed 30 novels into two computer programs then asked the computer programs to identify six additional works. The programs succeeded, using techniques completely different from humans: word counting and grammatical/semantic signals. The lab is also embarked on another project using social network analysis to study plot development, similar to how HP's Visual Intelligence Explorer uses cloned code patterns to understand clusters of application code.
The article caught my interest because in the past months I’ve been working on a similar premise in legacy source code analysis. Module by module I’m counting key words and categorizing them, and then I’m statistically reducing the count, and then feeding the results into a machine learning tool.
First, we have legacy experts categorize the modules, the learning part of the process. Then the machine learner looks for patterns, matching the legacy expert’s judgment using patterns the legacy expert likely never noticed. The learner then produces categorization rules. Not for literary genre, but for module functionality: reporting, data integration, user interface. And it get’s smarter the more it reads. These rules are then used to categorize more modules. It’s an iterative process.
Here’s the point: You should know your application’s functional composition before you make transformation decisions. Using automated classification approaches makes the process easier. Why reengineer COBOL into Java if it’s just producing reports? Use a reporting framework and achieve the agility you need with a more attractive ROI. Same goes for ETL, data integration, and an array of other functionality.
This is a process we discuss in HP's Application Transformation Experience Workshop.