Data Acquisition

From MT Talks
Jump to: navigation, search
Lecture 6: Data Acquisition
Lecture video: web TODO

There seems to be a universal rule for (not only) statistical methods in NLP: more data is better data. [1][2]

Translation systems have at their disposal (orders of magnitude) more training data than a person reads in a lifetime.[3]

Available Sources of Large Data

This is definitely not a list of all possible sources, just a few of the interesting ones.


Google released n-grams of the whole web and of Google Books.

Common Crawl is an initiative which builds an open repository of crawled web. Moses n-grams are similar to Google n-grams but have been computed on the Common Crawl. There is no pruning, so the data are much larger but can offer more detailed statistics.


Politics has been the motivation for a lot of parallel corpora. Canadian Hansard is published both in French and English and is one of the best-known parallel corpora. EU regulations are published in all official European languages, providing an invaluable language resource (if nothing else).

OPUS is a repository which contains most of the publicly available parallel corpora. The data is available for download in several formats, cleaned and processed with a unified pipeline.

Obtaining More Data

More data, or more data of a specific kind, can be obtained e.g. via crowdsourcing.

People also create large amounts of data every day and a good part of this is published via social media. It is therefore not surprising that some research in NLP focuses on leveraging these interesting new data sources.

Zipf's Law

Zipf's law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.

Implications of this law can be observed everywhere in NLP. While just a few dozen most frequent words (types) will cover half of all tokens (word occurrences) in a natural language corpus, the tail (infrequent words) is extremely long.

Moreover, even if we collect many times more data than we have at the moment, we will not cover much more of the tail and many infrequent words will remain out-of-vocabulary (OOV) for our NLP system.


  1. A. Halevy, P. Norvig, F. Pereira. The Unreasonable Effectiveness of Data
  2. Jan Hajič, Eva Hajičová. Some of Our Best Friends Are Statisticians
  3. Philipp Koehn. Inaugural lecture.