ISBN Syntactic Dependency Parser

A syntactic parser achieving state-of-the-art results on many languages. Latent variables are used to induce features appropriate for the language and grammar formalism. Results in the CoNLL 2007 shared task: (Entry: "Titov et al.") and in the CoNLL 2009 shared task (syntactic scores): (Entry: "Merlo")

Download parser and documentation

Relevant publications:

Unsupervised Semantic Role Labeling

Our non-parametric Bayesian model for semantic role induction achieves state-of-the-art results, and applicable in a multilingual set-up.

Download code, data and evaluation scripts

Relevant publications:

Bayesian Unsupervised Semantic Parsing

Our Bayesian model for Unsupervised Semantic Parsing induces frame-semantic representations on top in unsupervised way by distributionally clustering fragments of syntactic dependency trees.

Download code, data and evaluation scripts

To run the QA experiments we decribed in the paper, you will need to obtain the preprocessed version of the Genia corpus and evaluation scripts from Hoifong Poon, follow this link.

Relevant publications:

Crosslingual Distributed Word Representations

Crosslingual representations of words (for several language pairs) can be downloaded here. To replicate our experiments you need NIST RCV2 data. Please obtain a free license from NIST, and then we can provide you with processed data, evaluation scipts, and the code.

A relevant publication:

Aspect-based sentiment summarization

In this work, we considering joint induction of sentiment and aspects (e.g., location of a hotel vs cleanliness of rooms for a hotel review) for fragments (sentences / phrases) of texts. Interestingly, it is useful to model flow of discussion in a text (i.e. elementary discourse relations) at the same time as inducing the sentiment and aspect information.

The data can be dowloaded here

Relevant publication:

Semantic script data

In this work, we were learning both the notion of events / frames and how these events are organized into higher level activities (e.g., going to a restaurant involes entering a restaurant, waiting to be seated, etc).

In some of our experiments we used NY Times Corpus, this dataset is available via LDC but we can help you to reproduce our setting. However, we provide the crowdsourced data. The dataset includes both development (dev) data as well as test data. Each of these two directories contains different script scenarios located in respective sub-directories, e.g. the script scenario for preparing coffee can be found in the sub-directory test/coffee. Please consult the readme for further information. This dataset is a preprocessed version of a corpus created witin the SMILE project at the Saarland University. Moreover, part of this dataset is coming from the MIT OMICS corpus.

Relevant publications:

Generation of captions for videos

Please refer to the web-page of the relevant project at the Max-Planck Institute for Informatics.

The relevant paper: