Data Acquisition

From MT Talks
Revision as of 16:17, 23 February 2015 by Tamchyna (talk | contribs)
Jump to navigation Jump to search

There seems to be a universal rule for (not only) statistical methods in NLP: More data is better data.

Translation systems have at their disposal (orders of magnitude) more training data than a person reads in a lifetime[1].

[2]

[3]

Available Sources of Large Data

Monolingual

Google n-grams, Google book n-grams, Common Crawl + Moses n-grams

Parallel

OPUS

EU

token vs. type

Obtaining More Data

Crowdsourcing

Social Media

Zipf's Law

References

  1. Philipp Koehn. Inaugural lecture.
  2. A. Halevy, P. Norvig, F. Pereira. The Unreasonable Effectiveness of Data
  3. Jan Hajič, Eva Hajičová. Some of Our Best Friends Are Statisticians