Past 6.806/6.864 Projects

Sequential short text classification using recurrent neural network

Abstract: Short text classification is an old problem in natural language processing, and has various applications from question answering systems to sentiment classification. Recent approaches based on artificial neural networks have shown competitive results compared to traditional approaches such as support vector machine (SVM) and maximum entropy. However, many existing systems for short text classification do not incorporate the context of the short texts. In this work, we present a system based on recurrent neural networks, which sequentially classifies short texts based on the context. We evaluate our approach using the dialog state tracking challenge 4 (DSTC 4) dataset, which consists of 35 dialogs annotated with speech acts for each utterance. Our preliminary results show that incorporating contextual information improves the performance of classification over noncontextual models, i.e., models that classify each sentence separately without using the context. In addition, we compare our approach with a baseline using SVM and logistic regression with context.
Modeling Recipe Steps Using Skip-Thought Vectors

Abstract: The generation of sequences of instructions remains an open problem in NLP. A large part of the challenge is to represent each step in a way that enables the construction of a robust model that supports operations like reordering and inference. To this end, we investigate the use of sentence-level embeddings known as skip-thought vectors in predicting the next and previous steps in a recipe. Trained on a corpus of 150k recipes with a total of 1.7m sentences, skip-thought vectors perform reasonably well as features for discriminating the next step, achieving an accuracy of 0.689 when given five candidate steps, but fare poorly when tasked with actually generating the next step. Additional experiments show that, while skip-thoughts vectors can be accurately decoded to the encoded step, there may be limits to their representational power.
Identifying Entities And Their Relationships From Medical Articles

Abstract: This project aims to automatically extract the influence of various food items on different diseases from a large corpus of articles. Specifically, the problem setting is as follows. We have a collection of medical documents corresponding to the effect of particular food items (e.g. alcohol, dairy, meat, etc.) on different diseases (e.g. breast cancer, diabetes, etc.) in a specific population group (based for instance, on age or ethnicity or having some previous medical condition). The effect could, for instance, be positive (accelerating the onset), negative, or unknown. Using minimal supervision, our goal is to extract sentences from these articles that can identify the different entities, i.e. the specific population group, the food item, the medical disease or ailment, and the effect. We set up our problem in a neural setting and propose two approaches. The main idea is the following. Each sentence can be abstracted in terms of the words describing food, disease, population, effect, and the rest of the sentence. We learn a 5-block hidden representation of each sentence, where each block consists of several neural units. In the first approach, we associate a separate entity (food, disease, population, effect, other) with each block - we sample some training sentences, perturb them slightly by modifying some entity words, and do a block-constrained training by restricting the weight updates to only the units in the blocks that correspond to the modified entities. This hidden block layer is sandwiched between the input and output layers of an auto-encoder, and a subsequent training procedure tries to minimize the discrepancy between the input and the output. In the second approach, we do an end-to-end training using an architecture that contains a GRU encoder, the hidden block layer, and a GRU decoder, and measure the discrepancy between the input and output sentences directly. We form our corpus using the abstracts from the PubMed database. The initial results have been encouraging.
Understanding Shakespeare: Using Statistical Machine Translation To Translate The Plays Of Shakespeare Into Modern English

Abstract: The goal of this project was to translate the plays of Shakespeare into more modern English using statistical machine translation. One potential application of this is to use the output translation lattice to obtain a modern English translation in meter (e.g. iambic pentameter). First, a parallel corpus was created by scraping the No Fear Shakespeare website. The Berkeley aligner was used to generate word alignments; KenLM was used to generate the target language (modern English) language model (one using just the target text of the parallel corpus; one also using the English text from the Europarl corpus). Finally, phrase based statistical machine translation was performed using Phrasal. From the two language models used, BLEU scores of 21.769 and 21.598 were obtained respectively on the test data. Human evaluation of the test translations was also conducted. While sample size was small, subjects were generally able to distinguish machine from human translations, especially for longer or multi-sentence lines. The results of this project so far suggest that further work is needed to achieve accurate translations. One limitation of the generated corpus is that some sentence alignments are not one to one, but can be multi-sentence alignments, which makes statistical machine translation more difficult. One area for further work could be to further align the current corpus into individual sentence alignments. After more accurate translations are achieved, future work such as translating into meter could also be explored.
Towards Automatically Answering Video Game F.A.Q.’s

Abstract: A large portion of video game discussion online is initiated by players who are stuck and ask a forum for guidance. While this is effective for most players, the time taken waiting for a response is usually enough to force the player to stop playing, and as multiple players ask similar questions over time, it can be a waste of time for the experienced users to repeat their answer and to shift through the ”polluted” forum to find more advanced discussions. An alternate solution is to read a text walkthrough of the game, but searching the documents requires long linear scans or using ”Ctrl-F” to locate exact word matches. NLP techniques can be used to marry the two solutions by taking in natural language questions and using the walkthroughs as a knowledge corpus to provide proposed answers to users and reduce forum pollution. In exploring this problem, I used a dataset of 8.4 million word-units worth of walkthroughs and question-answer pairs collected from http://www.gamefaqs.com/ from eight games of varying genre and data amounts. I compare to a basic, baseline ”Ctrl-F” algorithm which takes question Q and weighs substring responses R by estimating the sum of P(w ∈ A | w ∈ Q) over w∈R∩Q. I improved upon the baseline with a ”Segment-F” algorithm which first segments the walkthroughs (which are mostly implicitly-structured ASCII text) using a variant of a convolutional neural network trained by a semi-manual supervision method inspired by Prorogued Programming [ABS12]. The algorithms were evaluated by creating a fake walkthrough by concatenating the known answers and testing how often the question’s true answer was chosen. ”Ctrl-F” is surprisingly effective given its simplicity, achieving 22.61% average correct, but it was beat by ”Segment-F” with 25.93% correct and extreme improvement on 3 games. These results demonstrate the promise of using automation to expediate the help process and reduce unnecessary forum posts.
Usage of Recurrent Neural Networks in Transition-based Dependency Parsing

Abstract: Transition-based parsing has the advantage of being very fast because it makes local decisions instead of trying to find a global optimum. Danqi and Manning1 created a fast and accurate transition-based parser using a neural network for the oracle, which decides which of the several possible actions take at each timestep. However, transition-based parsing, including this neural parser, tends to make myopic decisions precisely because it makes said decisions locally. In this paper, we attempt to address this using Gated Recurrent Units (GRUs) to give the oracle more global information for decision making.
Syntax Based Natural Language Models for Code Completion of Android API Calls

Abstract: Veselin Raychev, Martin Vechev, and Eran Yahav, in their paper “Code Completion with Statistical Language Models” proposed a natural language model of programs. Their model views each program as a series of sentences, with words corresponding to function calls. However, the model suffers some deficiencies. The model is unable to handle arbitrary loops, instead requiring a bound on the number of iterations. This eliminates many possible programs. Moreover, the method of constructing executions counts every occurrence of a function call in a loop once for every iteration, potentially skewing results towards occurrences common in loops. We address this by considering an alternate model, building the sentences from the parse-tree of the program in a manner that attempts to limit these concerns. We also expand the model to consider a MEMM for API call selection.
Program Correction as an Editing Problem in the Domain of MOOCS

Abstract: We attempt to tackle the problem of program correction as an editing problem in the domain of MOOCS. Given an incorrect student submission program P, find an minimal edit edsuch that the correct program Q is obtained by applying the edit ed to P. We frame this problem as a sequence translation problem, where we translate the sequence P to the sequence ed,rather than trying to translate to a correct program Q directly.
Learning High-Level Planning from Questions

Abstract: People often ask questions when faced with difficult problems. Acquiring information is a fundamental problem solving tool. In this paper, we explore using question asking as a tool for text-aware high-level planning agents. Our model jointly learns to ask questions, when to ask them, to interpret natural language answers, and to perform high-level planning. We train this via a standard reinforcement learning approach and use plan success as a reward signal. We implement question answering, which simulates human responses, via a naive Information Retrieval system matching questions to sentences in a wikipedia. We directly build upon and extend the MineCraft high-level planner described by Branavan et al. [0]. The difference in our approach is that instead of using the entire wikipedia to learn about the environment before problem solving, we acquire the information live as a part of our novel problem solving method. To assess the effectiveness of our approach, we compare our results to base lines of different question types, answer types, and the previous established results by Branavan et al [0]. [0] Branavan, S. R. K., et al. "Learning high-level planning from text." Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers Volume 1. Association for Computational Linguistics, 2012.
Discriminating between canonical from non-canonical fan fiction

Abstract: Fan fiction is usually stories written by fans of some fictional universe involving its settings and characters. The problem of this project is to try discriminate between canonical (compatible with the original storyline) and non canonical (contradicting original work) fan fiction. This is a binary classification task. This project works with Joanne Rowling’s Harry Potter series and corresponding fan fiction. The data has been grabbed from fanfiction archive (http://www.harrypotterfanfiction.com/). Behind the choice of this particular data source are relatively structured publishing pattern, detailed and structured description of every fan fiction piece. Baseline SVM on stemmed bigrams has been used to mark a base performance for the problem. I have tried two variants for the training set: with and without original chapters. Each entry in the training and test sets corresponds to chapter in fiction piece. Having two options for the training set helps to see if performance on fan fiction test set improves with adding original chapters to the training set. In general, SVM yields around 60-70% accuracy. The plan was to use LSTM and see if this yields improvement on baseline. Next step would be to see if along with prediction we can obtain argumentation: a part of text that accounts for the fan fiction piece being canonical or non-canonical. However, I have encountered difficulties with running LSTM. (Still working on it). Also I have tried running GloVe on the original chapters and have obtained vector representations for words to try analyze relationships between characters.
Unsupervised Discovery of Morphological Chains Using Data from Multiple Languages

Abstract: Based on a morphological analysis method proposed in “An Unsupervised Method for Uncovering Morphological Chains” (Narasimhan et al), the project tries to adapt the method in order to incorporate data from two languages. The words are modeled as morphological chains, starting with the base word and ending with the observed word, with each word in the chain being derived from a word before it. The low-rank tensor model uses two languages and produces morphological chains of the main language words by considering orthographic and semantic features of the words and their equivalents in the side language. This makes it possible to improve chain construction by capturing the morphological system similarities between the main and the side languages.
Computationally Identifying Confusing Passages in Textbooks

Abstract: A common way to learn about new subjects is to read a textbook. However, writing clear text that will teach novices new concepts can be difficult, with learners often confused by the explanations and examples in the text. In this work, we explore different linguistic aspects of textbook passages that may lead to confusion. We present a case study of textbook passages used in a number of introductory physics classes that have been annotated by the students during the course of the class with their comments and questions. Focusing on the annotations that express confusion, we seek to distinguish the passages that lead to confusion among students. We find that features that consider the type of terminology used, when and how new terminology is introduced, and the subjectivity and specificity of the language are predictive of confusing instructional text, among others. Our best model using linguistic features outperforms unigram and random baselines by around 20%, achieving an average of around 73% accuracy on our textbook dataset.
Identifying Customer Needs from Amazon Reviews

Abstract: Understanding customer needs is essential for successful new product development. The traditional approach to identifying customer needs is interview-based and requires manual analysis, which is expensive. Nowadays, millions of customers describe their preferences for products at online reviews, but marketing managers rarely explore online content, as it is noisy and often repetitive. At the current project, I identify a set of non-repetitive customer needs from Amazon reviews for a particular product category, and show how to organize them into a convenient structure.
Subversive Language Usage in Fanfiction

Abstract: Many fan scholars have remarked on the prevalence of slash fanfiction (fanfiction about same-sex relationships -- relationships that are in the vast majority of cases not present in the original source text, often referred to as canon). Some theorize that this prevalence is because writing slash is a way for fans to critique traditional concepts of masculinity or subvert and resist normative narratives of gender and sexuality. There are papers that discuss the role of narrative and character/relationship development in slash fanfiction, but it is still ambiguous whether slash fanfiction does something fundamentally different from "regular" fanfiction. In this project, I aim to understand whether use of language in slash fanfiction, especially as compared to fanfiction and canon, can be construed as subversive and transformative in the same way that narrative and character/relationship development have been established to be. To do so, I trained Google's open-source Word2Vec neural network on training data consisting of a billion words of news article text (to build a stable model of the English language); the text of all seven Harry Potter books (to build a baseline for vocabulary and concepts specific to or localized to the world of Harry Potter) and the text of a few hundred works of Harry Potter fanfiction. I labeled vocabulary in these works of fanfiction with the genre of fanfiction (slash or non-slash) such that Word2Vec would read "Harry" in slash fanfiction as distinct from "Harry" in non-slash fanfiction or "Harry" in the source text. This meant that I had separate trained embeddings for vocabulary in each genre, and I could analyze these embeddings to understand what differences in vocabulary usage emerged.
Python Code Completion as a “N”LP Problem

Abstract: Many techniques in natural language processing rely on statistical methods because natural language is fuzzy and imprecise. In contrast, computer languages follow precise known rules, so tools for working with computer languages are usually strictly rule-based. However, even in the realm of computer languages, some analysis tasks may benefit from statistical methods by being able to capture regularities in how computer code is authored, such as in the usage of common APIs. In this project I look at the task of code completion for Python, a language that presents some challenges for static analysis in the traditional sense due to its dynamic nature. The task is as follows: given a corpus of existing Python code and a test program that includes blanks to be filled in, suggest a list of possible completions for each blank. I develop several approaches for this task, including a discriminative model for predicting attribute and method names using contextual features based on the abstract syntax tree of the Python program.
Non-Markovian Control Policies for Text-based Games using External Memory and Deep Reinforcement Learning

Abstract: Recent work by Narasimhan et. al. (2015) employed a deep reinforcement learning framework to learn control policies for text-based games, where all interactions between the agent and the virtual world are through text. This previous framework is Markovian in that the agents selected action from only the current textual description. However, games in general do not always exhibit this Markov property because relevant information from past descriptions could affect the current action. Our work attempts to use experience replay memory and memory networks to deal with the non-Markovian nature of these games. We compare the results of our architecture with the framework in Narasimhan et. al. on custom games with quests that require memory from previous game states.
Linguistic Harbingers of Betrayal: A Case Study on an Online Strategy Game

Abstract: The paper of Thomas, Pang and Lee [0] talks about a method to determine whether a speech made on U.S. Congress floor supports or opposes the proposed bill. In particular, the paper improves the result substantially by including agreements between different speeches. This project explores the effects of modifying how the inter-speech relations are accounted in the model. [0]: http://www.cs.cornell.edu/home/llee/papers/tplconvote.dec06.pdf
Dependency Parsing of Low-Resource Languages

Abstract: Although state-of-the-art parsers on English reach an F1-score of above 90%, it is hard to reach such scores on less common languages. A major impediment to parsing for low-resource languages is the lack of availability of gold tree banks on which to train parsers. Previous work has explored transferring dependency trees between different languages using good quality parallel data; however, this method is not applicable to low-resource languages, since it requires large, good-quality parallel corpora. In this project, we explore techniques to project dependency arcs onto target-language sentences from comparable (or “almost parallel”) corpora: that is, documents which are about the same topic and which are likely to contain some similar sentences, but which are not exact translations of each other. The inspiration for this direction of work is that Wikipedia can serve as an abundant source of comparable corpora across many different languages, so if we can develop effective dependency tree projection techniques for comparable corpora, these techniques could be very applicable to low-resource languages. In a nutshell, we use dictionary data from Wiktionary to align sentences from resource-rich languages English and French to similar sentences in a (low-resource) target language, then “project” dependency arcs from the resource-rich sentences to the target-language sentences to create training data for a target-language parser. Our preliminary results achieve precision above 50% for Wikipedia data (and over 70% for parallel Europarl data). For certain types of dependencies (such as noun→adjective), we achieve precision around 80% for Wikipedia data (and up to 96% for Europarl data). However, we note that our techniques do not produce all dependency arcs, so accuracy is somewhat lower: around 40% for Wikipedia data, and around 60% for Europarl data. We believe that there is potential to refine these techniques to further improve performance.
Turning Experimental Procedures into Machine-Readable Recipes

Abstract: Large scale manufacturing has been made more efficient with automation and machine procedures centuries ago. Scientific researches, however, still require scientists or technicians to perform the experiments due to the inherent variability and complexity. Battery research, in particular, includes large amount of hand-on experiments. In order to help automate the research, we attempt to create step-by-step machine readable procedures from experimental procedure sections of journal publications using common techniques in natural language processing. To achieve this goal, we built a pipeline process that contained several steps. First, it would take in a list of raw sentences from the experimental procedure text and label them using our custom tags to denote whether or not words were (A)ctions, (I)ngredients, (E)quipment, (PR)oducts, (P)roperties, (R)eferences, or (N)one. It would then group together words based on these labels, and separate out steps such that each step contains exactly one Action. Each step is ordered using a shift based dependency parser to connect the tagged groups with each other. Finally, it orders all of the steps based on keywords in the procedure so that each step follows chronologically from the previous step. The result is outputted as a list of dictionaries that represent each step, with details regarding what is needed to perform each step. In order to evaluate the performance of our model, we have built a small data set as our training data and development data.
Flow graph construction from unsupervised cooking recipes data

Abstract: Thousands of cooking recipes can be found on the web in a form of unstructured text. In this project we use this large unsupervised corpus of data to build a model that transforms recipe-for-human into a sequence of strict instructions (states), which can be executed by a robot. That defines which actions should be performed, in what order, on which objects and with the help of which subject, taking into account the quantity of used ingredients. We use semantic role labeling for initial approximation to the states and then fine tune them with common sense hypothesis. We also train a probabilistic model to set up connections between different states, therefore constructing a flow graph with the highest likelihood..
Recurrent Neural Network Encoder with Attention for Community Question Answering

Abstract: We apply a general recurrent neural networks (RNNs) encoder framework to address the community question answering (CQA) selection tasks. Our approach does not rely on any linguistic tools and can be applied to different languages or domains. Further improvements are also observed after we extend the RNN encoders with neural attention mechanism that encourages reasoning over the entire sequences. To deal with practical issues such as data sparsity and imbalanced labels, we also apply various techniques such as transfer learning and multi-task learning. Our experiments on SemEval- 2016 CQA task show 9% improvement on MAP compared to the information retrieval-based approach and paralleled results to a strong handcrafted feature-based method.
Duzen oder Siezen: Predicting the Formality of Second-Person Address in German

Abstract: Many non-English languages possess two different forms for formal and informal second-person pronouns, yet there is not much existing research in NLP characterizing this dichotomy. By using contextual features to predict the form of second-person address, many related problems such as machine translation, information extraction, and conversational AI could be improved. In this project, the dialogues between characters in German novels from Project Gutenberg are used as training data to inform models making use of bag-of-words and n-gram features to predict the formality of second-person pronouns in a test set of dialogues where these words have been erased. Such models were found to have significantly higher accuracy than random or mode baseline approaches.
E2VScorer: An Embedding-Based Approach to Evaluation of Argument Strength in Student Essays

Abstract: While automatic essay grading remains a topic of interest in NLP, methods which seek to evaluate argument strength, instead of more formal dimensions, remain rare; furthermore, those described in the literature primarily rely on feature-based classifiers, whose heuristic rules can lose predictive value when applied across different essay sets. We propose a supervised, neural network-based approach, based on word2vec and one of its extensions, doc2vec, to construct a classifier over trained word, essay, and prompt embeddings; this classifier provides a means for scoring essays on the basis of the quality of their argument in answering the relevant prompt. Our method outperforms both a naive, probabilistic baseline and a feature-based approach which relies on heuristically applied sentence labels.
Learning Character Graphs from Literature

Abstract: Literary scholars, when comparing different works of literature, frequently find salient characters and their relationships to make comparisons. As this process is manual and requires close reading, it is then difficult to compare large sets of literature. In this paper, we describe a machine learning approach to building a system which automates this task. We describe a heuristic means of collecting labels from online character descriptions (such as those in Sparknotes), features and classifiers for labeling salient characters and binary relations, and methods for disambiguating characters output by classifiers. We report results on a dataset of 100 novels extracted from Project Gutenberg, giving precision and recall for both raw classifier output and disambiguated output.
Sifter, a New Machine Learning Application for Clustering Medical Research Findings

Abstract: The quantity of medical and scientific literature available to the average scientist is increasing at a rapid pace. However, there is currently no good method for easily extracting information from this multitude of data without extensive human interaction. As a result of this inability to easily sift through data, many important findings from cutting edge medical research go unnoticed by the rest of the scientific community. What is needed is a new tool to simplify the act of organizing medical research data based on clusters of findings and topics. This is what my project, Sifter, aims to do. Utilizing Amazon Web Services’s powerful backend, Sifter cross-references NIH’s PubMed Open Access dataset of 1,156,698 full-text XML medical research papers and 82,448 meta-research articles to automatically create training clusters of article topics without human interaction. Using this training set, Sifter is able to generate a neural network model that can also cluster new incoming articles in an online manner. The specifics of Sifter’s neural network accuracy will be revealed during the 6.806 final poster presentation for the class.
Assisting Coreference Resolution with Word Vectors

Abstract: Coreference resolution is the problem of linking pronouns like ‘he’ or ‘they’ to the subjects they represent. Some coreference problems can be solved by syntax (for example, ‘they’ would never refer to a single individual), but the most difficult coreference problems, such as those in the Winograd dataset [1], can only be solved using semantic knowledge of the world. Previous work attempting to solve Winograd-style coreference problems have leveraged various techniques including predicate-based rules, such as [if X is scared of Y, then often X is afraid of Z]. We explore an approach that utilizes the semantic information stored in word vectors to crack Winograd-style coreference problems. The word-vector approach on its own provides about 60% resolution accuracy, suggesting that word vectors may be a useful tool for coreference resolution. [1]: http://www.hlt.utdallas.edu/~vince/data/emnlp12/
Sequence-to-Sequence Dependency Parsing

Abstract: We propose to build a dependency parser that uses an LSTM neural network to directly predict head-tail relationships from input words. This represents a simpler approach to parsing than shift-reduce parsing with a neural classifier, which is the state of the art in dependency parsing. Our approach is inspired by the attention-based architecture of neural translation models. With our architecture, we hope to achieve parsing accuracies that are competitive with those of shift-reduce parsers.
Predicting Chemical Reactivity with Auto-Encoded Embeddings and Recurrent Neural Networks

Abstract: Knowing the steps and input reagents needed to synthesize a chemical is essential for industrial and research purposes. However, chemical reactions (and their papers) can be difficult to search. Furthermore, chemists cannot easily predict the reactivity of two chemicals if the database lacks the exact reaction. We pose this problem as a natural-language processing task on chemical names. First, we auto-encoded embeddings based on chemical names, relating similar words closer in the feature space. Then, we trained a recurrent neural network to predict whether two chemicals will react using a database with known chemical reactions (ChemNet). We compared our model to a naive Bayes classifier, a maximum entropy / logistic regression classifier, and an SVM classifier trained on true reactions from the ChemNet database interspersed with negative examples. Our model has many possible extensions, including the construction of novel synthesis pathways.
Corpus-Based Question Answering with Neural Module Networks

Abstract: One critical but challenging problem in natural language processing is to consistently understand and correctly answer general questions about text. In this project, we tackle a subset of these questions, in this case middle-school level multiple choice science standardized-exam questions, using neural module networks (NMNs), recently used in machine vision to answer questions about pictures. We have extended some of the original modules (i.e. measure, combine, attend) created for the NMNs to be able to process text corpora as well as the images, and then to create our own modules that are more application-specific to the corpora we are analyzing (i.e. a corpus parser to identify the more relevant paragraphs to search through, in order to cut down on noise). Finally, in order to maximize the percentage of questions answered correctly, we also attempt to compose NMNs with feature detection methods. Referenced paper is at http://arxiv.org/pdf/1511.02799v2.pdf.
Neural Network Techniques for Producing Ingredient Representations in Vector Space

Abstract: There have been many significant advances in recent years in the natural language processing community in using neural network based language models to produce a distributed representation of words, most notably word2vec. However, there has not been much research in applying these effective models to applications outside of NLP. We show an NLP approach in the of modeling ingredients in food products by using a neural network model to learn the vector representations for the ingredients. In addition, we will use our neural network model to answer two questions: Given a set of ingredients, 1) is this a valid combination of ingredients and 2) if this is valid combination, what food category does this set of ingredients belong to? Ingredient embeddings generated using the skip-gram model show high proximity between ingredients that have similar semantic meanings (e.g. baking soda and sodium bicarbonate, apple and strawberry). In addition, the model is able to correctly predict the validity of a set of ingredients with an accuracy of 92%. From 1124 food categories, the neural model is able to correctly predict the category more than 52% of the time. When the number of categories is reduced to 16, it is 77% accurate. Finally, we leverage a hierarchical database to map unseen ingredients to their vector representations.
Predicting the Effectiveness of Cardiac Resynchronization Therapy Using Natural Language Processing

Abstract: Cardiac Resynchronization Therapy (CRT) has become a standard therapy for a subset of patients suffering from Heart Failure (HF). While CRT is successful in the majority of cases, a significant minority of patients express a neutral or negative response to the therapy without well-understood cause. Since much of the relevant information in clinical data is in narrative text form, current analyses are limited to small information-rich datasets where researchers can read clinical notes on each patient or large information-poor datasets where only structured information on each patient is analyzed. In this paper, we use state-of-the-art natural language processing (NLP) techniques to analyze a large dataset of CRT patients. We find that including the free-text information through the use of NLP techniques not only improved performance over current clinical guidelines, but also allowed our model to discover latent clinical variables of the problem. We hope these results will motivate further research into previously unknown predictors of successful CRT outcomes and demonstrate the benefits of using NLP models in clinical settings.
Change My View!

Abstract: We analyzed comments on the internet discussion forum at reddit.com/r/changemyview, a location where people can submit an opinion and ask for rebuttals or opposing viewpoints. Selected viewpoints can be awarded ‘deltas,’ signifying that they successfully changed one’s opinion. We built a crawler to generate our own dataset, since this space did not have an existing corpus for use. We then built multiple classifiers that use information both about the comments themselves as well as their context in the entire comment tree to attempt to predict what posts would be awarded ‘deltas’. We achieved a precision of 12% and a recall of 10% on our test set of approximately 40,000 comments. The numbers reflect the difficulty of the task, because a variety of difficult-to-quantify features factor into the ability of a post to be convincing; personal opinions, long multi-comment discussions, and so on.
Giving Machines Memory and Focus: Evaluating Attention-Based and Memory-Based Neural Network Models on Large Q&A Datasets

Abstract: One of the most basic tasks in natural language processing are question/answering tasks. Recent advancements in deep neural networks have shown promise in extracting information over long sequences of data as well as integrating a semantic knowledge base. In this work, I look at recently introduced neural network models including the Memory Network and Dynamic Memory Network. These models seek to apply a semantic memory as well as an attention mechanism to answering questions about articles. While they have seen success on smaller corpuses (such as the Babi tasks), it remains to be seen whether they can be successful on larger datasets. I evaluate the models on such datasets (Wiki QA, Google CNN) and observe an improvement over baseline LSTM models. The improvement is especially pronounced on the largest datasets.
Determining API Usage Policies from Method Documentation

Abstract: The public APIs of various software packages contain constraints on what constitutes a valid use of each API method, often documented in both explicit (e.g., “may not be null”) and subtle (e.g., “a duration in milliseconds” implying a non-negative integer) manners. In this work, we attempt to build a system that can learn constraints on method parameters from their documentation. Notably, our approach requires no annotated data, instead learning correct usage automatically by performing static analysis on a corpus of API usages (that is, calls to API methods). We demonstrate our technique on the problem of detecting when a parameter is expected to be constant. The extracted policies can be used in automated checkers to detect bugs in programs that use the API.
Unsupervised Learning of Hierarchical Representations for Recipe Text with Deep Belief Networks

Abstract: We explore the application of deep belief networks (DBNs) as language models for the instruction text of cooking recipes. We provide quantitative evidence that DBNs can outperform simpler, non-hierarchical language models, and qualitative evidence that they can succeed at learning reasonable hierarchical representations. We also present preliminary results focusing on extracting and improving the hierarchical representations learned by our model, with a focus on encouraging sparsity in the hidden layers.
NILE: An Interactive Natural Language Interface for Relational Databases

Abstract: There is an increasing push towards making databases more accessible. A natural language interface for databases (NLIDB) is considered as the ultimate goal for a database query interface. Existing approaches are tailored to work in specific circumstances. In this report, I describe NILE, a prototype NLI for relational databases which makes no assumptions about the database schema. NILE uses an off-the-shelf parser to construct dependency parse tree, constructs questions to ask the user about domain-specific terminology, uses a classifier to take local actions on the parse tree to make it resemble SQL and in the final pass generates a SQL query from it. By these means, a logically complex English language sentence is translated into a SQL query, which may include joins and nesting. The system is evaluated on two benchmark database domains [results pending].
TextBooster: Substituting Desired Vocabulary into Daily News Content

Abstract: Studying for standardized exams (e.g. SAT, GRE) requires regular practice with advanced vocabulary, yet it is difficult to find time for deliberate vocabulary practice. Although people read web articles on a daily basis, it is uncommon to encounter the desired set of difficult vocabulary words in context, because their occurrence in daily language is rare. We introduce the idea of modifying regularly encountered sentences during web browsing by substituting desired vocabulary words into existing sentences without substantially modifying the original meaning. We first explore the practical feasibility of this approach, and find that approximately 10% of SAT words have close synonyms that appear frequently in news corpuses. Furthermore, among the top 7 new stories from the front page of the New York Times (on December 6, 2015), 23 sentences could reasonably be substituted with an SAT word. After training classifiers using language model features, we present classification results on sentences that were heuristically created through the removal of target words from existing sentences. Finally, we report on performance with real sentences, using human annotations as ground truth. These results provide insight into how daily content could be actively transformed for the purpose of encouraging productivity.
Semi-supervised Compositional Vector Parsing with a Skip-gram Model

Abstract: The goal of this project is to explore a method for training a compositional vector parser in a semi-supervised setting. Our method includes an additional objective to guide and regularize the model by encouraging phrase vectors at each level of the compositional model to be able to predict their context. Each constituent vector is used in a Skip-gram model to predict the words surrounding the phrase. This objective is optimized over a large set of trees labeled by a PCFG parser - the same parser which is used by the compositional vector parser to generate candidates for re-ranking. While we hope this method will be able to eventually boost parsing performance, at the time of this writing, our implementation is unable to increase accuracy.
Application Of Information Extraction To Automate Research On Chinese Corruption

Abstract: This project is an attempt to extract structured information from Chinese news articles reporting corruption cases, as an input for the subsequent research on Chinese corruption. In particular, for each article, we are interested in knowing the culprit, the crime, the punishment, the job of the culprit, and the time and location of crime. We had hand-annotated data for 400 news articles in the last 20 years thanks to Prof. Karen Zheng from Sloan School of Management. Our strategy is to first extract the relevant information from those segmented articles, i.e. named entities such as person name, location, time and amount of money, as well as the relevant sentences for crime descriptions and punishments. Using a commercial software, Boson NLP, we were able to extract the human names and location. We then created feature vectors to match crimes, time, and job position. With the extracted information, we were already able to reach a reasonable 20% precision / recall scores. The second step is to match the information to the correct culprit, i.e. to match the culprit with his own crime, punishment, money involved, position, location and time. Our baseline is to match them with a simplified distance function. We filter the positions and patterns which indicates the corresponding person is more likely to be positive figures and thus unrelated to crimes. This increases the precision by eliminating the reporters, police officers and judges that appear in the same context. An attempt was made to use syntactic parsing to help with name-crime correspondence. Language Platform Cloud(LTP) was the software that aided us for syntactic parsing. While this is in general a wider and more open topic in NLP, we mostly came up with 2 algorithms: 1) with juxtaposition patterns recognized from syntactic parsing, we deal with the situation where multiple culprits or objects are mentioned together in one sentence. 2) with Subject-Verb-Adverb-Object relations recognized from syntactic parsing, we improve the baseline distance calculation algorithm and better match the correct person to his criminal actions.
Feature-rich event detection in social media streams.

Abstract: Social networking platforms such as Twitter have emerged in recent years, creating a radically new mode of communication between people. Monitoring and analysing rich and continuous flow of user-generated content can yield unprecedentedly valuable information, which would not have been available from traditional media outlets.However, learning from Twitter streams poses new challenges, as compared to traditional media. Traditional approaches to information extraction from microblog streams involve clustering based on semantic features of tweets. In this project, we implement and study different clustering models based on certain Twitter-specific features which include geopositional data, timestamps, hashtags and check-ins. We also perform clustering based on bursty behaviour of n-grams in Twitter documents. The resulting clusters contain valuable information about live events, but they also do contain a lot of noise. We treat this as a binary classification problem and use Machine Learning algorithms to classify and rank the clusters based on their textual, social and temporal features. Finally, we summarise top-ranking clusters and output the results to the end user.
How Does Content Drive Viewership?

Abstract: We are interested in exploring the relationship between content and viewership. Using our unique dataset from the New York Times (NYT), we are able to accurately determine the number of views an article on the NYT website receives. We scrape the NYT website for the content of each article and use various NLP techniques to construct textual content features such as article perplexity, sentiment, and reading difficulty. We also try to develop a robust set of control features by using the internal meta-data and the non-textual components of each article. We then use all our features to build a predictive regression model.
Morphological Word Segmentation using Semi-supervised Learning and Cross-language Analysis

Abstract: Building on an existing algorithm to perform morphological analysis, based on using semantic features and contrastive estimation to detect morphological chains, we investigate the effects of various types of extensions to this algorithm. We began by adding certain features to the model and heuristics for prefix/suffix selection. In addition, we extend the unsupervised model with the log-likelihood of known word segmentations, producing a semi-supervised model. Finally, we also attempt to use Turkish morphological parents and an English-Turkish translation dictionary to help detect irregular English word segmentations and transformations.
Decision making Text Summarization On Cochrane (Medical Database)

Abstract: In medical community, conclusively determining whether treatment or medicine is effective for certain symptoms requires careful review of all related papers published, and making sure that experiments are statistically significant. In this project, we design an NLP algorithm that will automatically produce three types of decisions “YES”, “NO”, “MAYBE” for given topic based on related articles, and we will further train and evaluate accuracy of the model on data gained from Cochrane community. Cochrane is the medical database that has summaries for topics such as “Honey for child cough”, and “Adjuvant Therapy for completely resected Stage II Colon Cancer” etc. The database considered as a credible, and accurate source since humans conduct all summaries by analyzing all related articles and papers. Since Cochrane summaries have good format, and referenced articles are also credible, noise on training data is relatively small. However, in some cases, even though all reference articles supports the claim meaning decision is “YES”, Cochrane summary suggests “MAYBE” or “NOT” since articles reference has experiments that are statistically not significant. In this NLP algorithm, we do not deal with such tricky cases, which requires statistical analysis. Therefore, we filter those topics on Cochrane database from our training and test set. This filtering is done by classifier, and may have error on it.
A Method for Relation Extraction Using Named Entity Vectors

Abstract: Relation extraction is a problem in natural language processing that is integral to discovering structured knowledge in unstructured data such as news corpora. We tackle the task of discovering lists of semantically related named entity pairs from the text of news articles. We first present a naive solution to serve as a benchmark. We demonstrate that applying simple linear transformations in the original vector space to all combinatorial pairs of entities from an article produces poor results. We identify the following two main challenges: a) properly filtering out pairs of unrelated named entities and b) finding vector representations for a single entity pair that we can use for comparison with other entity pairs. To reduce the number of noisy entity pairs, we use dependency parsing to extract named entity pairs that are syntactically linked as a heuristic for semantically linked entity pairs. To discover a better vector representation for a named entity pair, we employ an autoencoder neural network model. We present the results of these combined methods against the benchmark.
Guess My Mood

Abstract: Natural language sentences, especially in fictional narratives, often convey emotions that can be understood by human readers. However, unlike in speech, written text does not have pitch or prosodic cues that signal emotions. In order to extract the emotional elements of a sentence, a more representative set of features and models are required. In this paper, we explore the text-based emotion prediction problem by investigating effectiveness of different feature sets and existing models. Finally, we construct and evaluate a new two-pass computation model based on neural networks. Our feature extraction method uses emotion lexicon and a dependency tree to form sentence feature vector that is then fed into a neural network model to extract the sentence emotion in the first pass. The predicted emotions are then fed into a Long Short-Term Memory neural network (LSTM) for reevaluation. We base our experiments on a fairy tale corpus containing 176 annotated stories of 15,302 sentences total. The proposed feature set, when using naive bayes model for training, yields higher average F1 than our baseline. However, the two-pass neural networks model does not improve the results, potentially due to shortage of training data and unbalanced emotion classes.
Reinforcement Learning for Transition-Based Parsing

Abstract: We evaluate the feasibility of a reinforcement learning (RL) approach for transition-based parsing. In transition-based parsing, the parser modifies a sentence configuration (stack, buffer, arcs) in discrete steps until termination. At each step, the parser must take a well-defined action (left-arc, right-arc, shift). In our RL framework, the parser learns its policy, a mapping from each sentence configuration to a probability distribution over the next action, by responding to an environmental reward. We first build a toy parser whose policy is characterized as a log-linear model with few (∼1800) features. We find that even with a local reward function, the toy parser can be trained with RL to achieve ∼63.6% unlabeled arc score (UAS) on the Universal Dependencies test set for English. We further show that by incorporating a global reward function, the toy parser achieves ∼72.1% UAS. This boost produced by the global reward is promising – it suggests that the ability of RL to train long-term decision making might be useful. Finally, we attempt to extend our RL framework to Chen & Manning’s (2014) neural parser, which was originally trained on a per-arc (local) basis.
Looking for Gendered Differences in Instructor Evaluation Language

Abstract: Instructors can receive feedback on their course content and teaching methods from students who fill out subject evaluation forms. However, some studies have indicated there may be gender bias in course evaluations: male and female professors seem to be rated differently. We are interested in seeing if we can predict the gender of the professor being evaluated using learned features from student text responses. Our data comes from an MIT sorority’s collection of informal student evaluations for humanities classes. We trained various predictive models such as logistic regression, SVMs, and topic models using text features of the student responses. Currently, we have been unable to train a predictive model that performs better than our baseline of guessing the most common gender.
Recipe Scoring with a Recurrent Neural Network Sequence-to-Sequence Model

Abstract: Adapting a sequence of instructions is a common task in many application domains. For instance, in the field of chemistry, it is often necessary to modify a series of chemical reactions in order to obtain a variety of desired end products. If the application space is large, the modification process may become intractable; however, in many cases, a large subset of the possible modifications are unsuitable. The purpose of this project is to devise a scoring system that will automatically find the space of acceptable modifications to a sequence of instructions, thereby eliminating the need to search the entire space exhaustively. For illustrative purposes, the scoring system will be trained and tested on instructions taken from pasta recipes. The system consists of a sequence-to-sequence translation model implemented using a recurrent neural network (RNN) based upon a gated recurrent unit (GRU) architecture. The model is trained by applying individual sentences to the RNN and requiring that the network replicate these training sentences. A test recipe is then given to the network, along with a number of modified versions, some of which are sensible, and some of which are not sensible. The system then gives a score to each modification of the original recipe. Experiments reveal that in the vast majority of cases, the system correctly identifies suitable substitutions in the original recipe.
Identifying and Combining Ingredients from Natural Instructions

Abstract: From naturally written instructions with steps but no explicit list of required components, we extract a set of components, or ingredients, that are used in the instructions. These ingredients consist of the materials necessary to complete the instructions given. We use a neural network to identify the ingredients. The training data is based on a list of food ingredients, which was used to identify and tag food recipes in our dataset. The neural network is compared against a set of baselines. In addition, we attempt to extract how to combine the ingredients together from the instructions.
Reimplementing Neural Tensor Networks for Knowledge Base Completion (KBC) in the TensorFlow framework

Abstract: Reasoning with Neural Tensor Networks for Knowledge Base Completion has become something of a seminal paper in the short span of two years, cited by nearly every knowledge base completion (KBC) paper since its publication in 2013. It was one of the first major successful forays into the field of Deep Learning approaches to knowledge base completion, and was unique for using deep learning "end to end". TensorFlow is a tensor-oriented numerical computation library recently released by Google. It represents algorithms as directed acyclic graphs (DAGs), nodes as operations and edges as schemas for tensors. It has a robust python API and bindings to GPUs. We reimplemented Socher's algorithm in the TensorFlow framework, achieving similar accuracy results with an elegant implementation in a modern language.
Generation of Action Graphs for Chemical Synthesis Procedures

Abstract: There exists a significant bottleneck in deploying task plans directly from natural language instructions. Often additional effort is required to create a formatted plan representation. There is a growing interest in reducing the translation burden by automatically generating action graphs, which summarize the text into a sequence of ordered actions. In this project, we apply several supervised methods to generate action graphs for the domain of chemical synthesis procedures. We contribute to the subproblems of identifying key action verbs, extracting relevant arguments (i.e., inputs and outputs of actions), and determining the global sequence.
Parsing with Neural Scoring and Randomized Greedy Inference

Abstract: Parsing with neural scoring and randomized greedy inference In this project, we take a structured prediction approach to parsing, scoring dependency arcs with simple feedforward networks and using the greedy inference algorithm of Yuan Zhang and Tao Lei (2014)1 to find the best-scoring parse tree. The network weights are trained in an end-to-end max-margin fashion to separate the scores of trees in accordance with how far they deviate from the correct trees. We compare results when using 1) log-linear rather than neural scoring, 2) arc-by-arc rather than end-to-end training, 3) first order vs. higher-order features, and 4) weaker vs. stronger inference.
Information Extraction for Articles on Mass Shooting

Abstract: We are interested in the problem of information extraction across multiple documents that are related to a specific topic. In particular, we build a system that uses a search engine to find relevant documents and extrapolates information from those documents. In this project, we focus on events about mass shootings. We begin with a set of events and a small set of relevant articles. Our task is to conclude facts about an event, specifically, the name of shooter, the number of people killed and wounded as well as the location (city) of the shooting from these articles. In order to predict facts from a single article, we use a Maximum Entropy Markov Model (MEMM) to tag individual words in documents, which are news articles from the web. Using these tags, we use an averaging method of words with certain tags to predict event information. We designed a novel way to improve the accuracy of the prediction. Instead of focusing on getting results from a single article, we expand the size of the set of tagged articles and combined the results. A search query is constructed from information in the original article and used to find other articles describing the same event online. Google and Bing API's are used to automate the search. To evaluate our results, we look at precision, recall, and F1 scores of the name of shooter, the number of people killed and wounded as well as the location (city) of the shooting.
Implementation and examination of an end-to-end trainable memory network as a language model.

Abstract: I implement an end-to-end memory network based on Sukhbaatar et. al’s very recent paper, ‘end-to-end Memory Networks’ (arXiv:1410.3916v11) in which a novel framework for implementing and training memory networks in and end-to-end fashion is presented. The key focus of this paper is to apply to memory network model to question answering tasks, however the authors touch upon the use of memory networks as language models, an exciting application due to memory networks’ ability to generate predictions based on an indefinitely large prior context. I present a TensorFlow implementation of an end-to-end memory network as a word-level language model as presented in Sukhbaatar et. al’s paper. I also examine the effects of various parameters on the model, including the number of memory hops utilized, and optimization method used. Additionally, I investigate the viability of a character-level language model based on an end-to-end memory network. Due to limited training time and limitations due to personal funds used for the purchase of server time, I am unable to entirely replicate the experiments and results of Sukhbaatar et. al, however, I present valuable insights into the use of memory networks as language models.
An NLP-based Approach to Improving Prediction of Twitter Follower Count

Abstract: The Twitter corpus presents a variety of very interesting NLP and machine learning problems due to its size (over 200 million active users) as well as the lack of formal syntactic structure, spelling, grammar, etc. which follows from a 140 character limit per tweet. Traditional machine learning approaches therefore have difficulty dealing with the Twitter corpus. Common NLP approaches need to be modified to be used with the Twitter corpus because of the difficulty of feature vector extraction from tweets. Aside from their brevity and lack of formal language structure, tweets also include hashtags, mentions, and links, all of which may create noise in the tweet data but at the same time may be able to aid the learning tasks if used intelligently. Our goal is to combine ML and NLP techniques in order to predict the number of followers for a Twitter user based on a given tweet. Number of followers is a valuable metric because it is an easily observable characteristic for a given Twitter account and is useful in determining a Twitter user’s influence in the Twitter social network. We show that augmenting traditional machine learning approaches with NLP techniques such as word vector embeddings and token tagging can improve the accuracy of predicting Twitter follower count.
RegexRNN: Generating Regular Expressions from Natural Language with Deep Recurrent Neural Nets

Abstract: This work presents a neural architecture for learning how to translate natural language into regular expressions. We use a dataset [1] of 824 natural language and regular expression pairs and train a deep recurrent neural net to generate regular expressions, given natural language prompts. This model is trained end-to-end with little to no feature engineering. This model learns embeddings of regex characters and natural language words and computes thought vectors for the composition of each with stacked LSTM RNNs. We evaluate our model by comparing our generated regular expression exactly to the answer and achieve (58.2% accuracy), nearly on-par with state of the art results (65.5%). These results are promising because our model is very data hungry despite our small dataset, and the current state-of-the art [1] relies on much feature engineering and regex DFA transformations that we do not perform. [1]: Kushman, Nate; Barzilay, Regina. ‘Using Semantic Unification to Generate Regular Expressions from Natural Language’ in North American Chapter of the Association for Computational Linguistics (NAACL) 2013.
Automated Argument Evaluation

Abstract: Our research project extends the work of the paper “Modeling Argument Strength in Student Essays” by Persing and Ng. Using tools provided by NLTK, Spacy, and Scikit-Learn, we train a classifier to score essays written by high schoolers based on how coherent the main argument is. In addition to features described in the original paper, such as POS n-grams and coreference features, we introduce features that capture sentence-level relationships, such as deduction and cause-effect. We also compare the accuracies of different classifiers, including SVMs, maxent models, and neural networks.
Recipe Modification based on User Comments

Abstract: On online recipe websites, a significant portion of user-generated comments include modifications to the recipe. We propose a method to modify a recipe based on these comments. The method consists of two parts: 1) a neural language model that scores a comment segment on whether it is likely a refinement, and 2) an RNN that computes a distribution over recipe segments for how likely a given refinement targets each recipe segment. Both parts assume that refinement wording is more similar to recipe wording than to the background text of reviews. Using this assumption, we can frame part 2 as a supervised learning problem input refinements come from removing segments from the recipes themselves, potentially modifying these artificially generated refinements, and attempting to place them back where they belong in the recipe. By framing this task as two supervised learning problems we hope to surpass the performance of prior, unsupervised approaches.