Semi-supervised and Unsupervised learning with Applications to Natural Language Processing


Instructor: Ivan Titov
Time: Fri, 12.15 - 13.45 am
Location: Building C 7.2, room 2.11 (?)
Office hours: send me a message by e-mail.
Note: First class is on April 15 (please come)
Note: Second class: May 6

Short Description

Statisical models are used to address an increasing variety of problems includes problems where complex structured representations need to be predicted (e.g., natural language sytnactic or semantics parsing, scene recognition in vision or protein structure prediction in bioinformatics).

However, most of these statistical methods rely on annotated data. Obtaining such labelled data for every task and every domain (e.g., a language or even a genre: consider language use in newswire vs. user reviews) is very expensive (or impossible). These observations suggest that we should look into methods which exploit unlabelled data (e.g., texts available on the Internet), either to induce a model on its own or to improve a model learnt from a small amount of labelled data. The topic of this seminar is exactly this: exploiting unlabelled data to induce or improve statistical models.

The class will cover machine learning methods for semi-(un-) supervised learning. The main focus will be on problems from natural language processing but most of the considered methods will have applications in other domains (e.g., bioinformatics, vision, information retrieval, etc). We will try to focus mostly on structured prediction problems (e.g., predicting a parse tree), as they are widespread, more challenging (and arguably more interesting).

Though most of the applications considered papers will be from the NLP domain, I do not require any prior exposure to NLP (though it would be a plus). Ideally, I expect that you have some prior experience with machine learning, statistical NLP or IR. If hesitant, feel free to contact me and ask.

The class will focus both on classic methods for semi-(un) supervised learning but will also consider some more recent but interesting (and often influential) techniques. We will also consider some interesting applications, such as semantic parsing of natural language and unsupervised grammar induction.


Requirements


Grading


Attendance policy

You can skip ONE class without giving any explanation to me (if it is not the class on which you are presenting). If you need to skip more, you will need to write an additional critical review for every paper presented while you were absent.


Presentation


Critical reviews


Term paper

Goal:

Grading criteria:

Length: 12 - 15 pages

Deadline: Available Later I would recommend to submit it soon after your presentation, as it would probably be easy.

Submitted in PDF to my email


Topics (some changes possible)

Note: References to papers, dates, and speakers will be provided in the Google Docs (a reference will be sent to attenders; in order to finalize the list I need to know the number of participants and their preferences)