Data Acquisition

From MT Talks
Revision as of 17:12, 23 February 2015 by Tamchyna (talk | contribs)
Jump to navigation Jump to search

There seems to be a universal rule for (not only) statistical methods in NLP: more data is better data. [1][2]

Translation systems have at their disposal (orders of magnitude) more training data than a person reads in a lifetime.[3]

Available Sources of Large Data

This is definitely a list of all possible sources, just a few of the interesting ones.

Monolingual

Google released n-grams of the whole web and of Google Books.

Common Crawl is an initiative which builds an open repository of crawled web. Moses n-grams are similar to Google n-grams but have been computed on the Common Crawl. There is no pruning, so the data are much larger but can offer more detailed statistics.

Parallel

Politics has been the motivation for a lot of parallel corpora. Canadian Hansard is published both in French and English and is one of the best-known parallel corpora. EU regulations are published in all official European languages, providing an invaluable language resource (if nothing else).

OPUS is a repository which contains most of the publicly available parallel corpora. The data is available for download in several formats, cleaned a processed with a unified pipeline.

token vs. type

Obtaining More Data

Crowdsourcing

Social Media

Zipf's Law

References

  1. A. Halevy, P. Norvig, F. Pereira. The Unreasonable Effectiveness of Data
  2. Jan Hajič, Eva Hajičová. Some of Our Best Friends Are Statisticians
  3. Philipp Koehn. Inaugural lecture.