Searching for Feelings — An Intern Works on Topic and Sentiment Analysis Source: Isaac Pena
Despite the prestige of an internship with The New York Times, when I signed on to do software development with the Search team — before I ever stepped in the building — I assumed I’d be working on small projects, fixing bugs that needed to be fixed, and essentially coding only in whatever spare quantity was required. I was perfectly happy to do so, of course, but I was quite prepared to receive piecemeal assignments for completion in small fragments of time.
As things turned out, I’m writing this while my code is compiling. It’s been about 25 minutes so far. As orientation to the internship program was phasing out at the beginning of June, I was greeted by the Search team and almost immediately given a swath of personal projects to choose from — individually-directed, team-assisted tasks that I could spend the full ten-week period of the internship on, with the end goal of leaving the team (and indeed, The Times at large) with a complete tool they could actually integrate and use in the search engine.
For many years, I’ve had a pretty serious interest in linguistics, but I never had the chance to do much with it save for taking a few linguistics classes at college. I always wondered if I could apply the skills I was learning as a computer science major to my academic hobby, but the two fields never seemed to cross while I was at school. So, when I was asked which of a selection of projects I wanted to pursue this summer, one prompt — linguistic sentiment analysis of articles’ subjects — stood out.
That’s how I came to be waiting 25 minutes for my code to compile. The initial idea for the semantic analyser has come to fruition through the use of already-extant linguistic tools. The Search team suggested exploring Google’s recently released SyntaxNet parser — a neural network pre-trained on a massive syntactic corpus which can read new sentences and break them down into their constituent parts and then explain exactly how the constituents are related. Additionally, the project uses vaderSentiment, a Python tool out of Georgia Tech which — based on website comments and user posts on Twitter — can determine with accuracy what the overall sentiment (positive or negative) of a snippet of text is.
My project has a fairly simple goal: to step through any given article published by The New York Times and to return a list of its major subjects or topics and whether the article relays positive or negative sentiment about those topics. On a large scale, I’ve used the Java-based Lisp dialect Clojure to construct a pipeline from the article itself through SyntaxNet into vaderSentiment — the output of which is the list of subjects and their positive or negative bent. This output can be integrated into the Times search engine — those readers interested in seeing the latest uplifting news on some subject will be able to search with a positive filter, and those who want to read more sobering news about a given figure will be able to search with a negative filter.
The project has, thus far, been an incredible opportunity to work with cutting-edge programs in the field of computational linguistics — which I’ve never had the chance to study in school — and to build a library of tools that The Times can implement as they see fit. I hope to walk away from the experience with a new swath of knowledge and experience in this technology, and that The New York Times finds good reason to deploy my project. But above all, I hope that it eventually provides value for the readership of The Times, and that this additional search feature improves improves our ability to understand our entire text corpus.
Isaac Pena is a summer intern on the Search team. He will return to Yale University in the fall as a junior in computer science.
| }
|