Discovery Files

Growing field of natural language processing works at the interface of computers and language

Researcher's algorithm design produces valuable tools for translation, text mining, question answering and more

Most of us never actually bother to read text labeled "terms and conditions," "privacy policy," or "financial prospectus." We usually just check the "accept" box, assume all is well, and move on. Still, wouldn't it be nice to have a computer program that could read them for us, and alert us to any problems?

This ultimately could be among the potential benefits of natural language processing, a field that works at the interface of computers and language. The goal of this growing field, which dates back to the 1950s, is to enable computers to glean meaning from language through the use of automated algorithms that process linguistic data from text.

"Imagine a computer program that reads text and interprets it to do something useful, but not the way you and I would do," says Noah Smith, an associate professor at Carnegie Mellon University's School of Computer Science. "This program could read text that people don't like to read, or don't have time to read."

Smith designs algorithms that analyze human language, research that produces valuable software tools for translation, text mining, question answering, information extraction as well as "scientific discovery wherever text serves as data," he says, for example in sociolinguistics, political science or economics.

The National Science Foundation (NSF)-funded scientist specifically is studying computational models for natural language parsing and semantic analysis, that is, to "take a sentence and try to figure out what it means," Smith says. "Doing this could help with certain tasks that people cannot do easily, such as extracting information out of very large collections, and would rather delegate it to a computer that will perform consistently and without getting tired."

It's kind of like the automated equivalent of the sentence diagramming that many of today's adults engaged in during elementary school. "Our programs analyze sentences into deeper linguistic structures," Smith says.

To be sure, these programs have a long way to go before attaining human-like reading and comprehension skills, he says. However, "humans are not always perfect at these things," he says. "People do have cognitive limitations, or limitations on memory, or they get distracted. Computer programs have complementary strengths.

"The public perception of this artificial intelligence branch of computer science is often colored by fears that robots might replace us," he adds. "We are, in fact, trying to make tools that will make life easier and less tedious for people."

Smith is conducting his research under an NSF Faculty Early Career Development (CAREER) award, which he received in 2011. The award supports junior faculty who exemplify the role of teacher-scholars through outstanding research, excellent education and the integration of education and research within the context of the mission of their organization.

Smith also has applied his methods to problems like measuring public opinion as expressed in social media messages, and in understanding the underlying ecosystems of companies by scrutinizing their press releases. The latter could benefit economists who study companies, specifically how companies interact with government or with each other.

As for social media, "text exists in a larger social context," he says. "People don't create language purely because they feel like it. Social media, like Twitter and Facebook, are intended to be read by people we know, but often there is a much larger conversation that goes on. The advent of these platforms allows us to observe more of these interactions, and hopefully to develop statistical methods to better understand what people mean when they write or talk."

Educational activities as part of the grant include the development of new exercises and a competitive project within the undergraduate course on natural language processing. In past years, Smith has written problems for the Computational Linguistics Olympiad, a high school competition. Smith also is committed to tutoring and mentoring the graduate students he advises. His group calls itself "Noah's ARK."

"A lot of them interact with each other, including the junior doctoral students who can learn from the more senior ones," he says. "I try to promote opportunities for them to see all parts of my job, and work with me on not just research, but on service and teaching at the university and beyond."

Finally, his research group makes its software publicly available, not just for students in the classroom, but "so it can be used by other researchers and startup companies, as well as by regular people who want to try their hand at natural language processing," he says.