Data Acquisition
There seems to be a universal rule for (not only) statistical methods in NLP: More data is better data.
Translation systems have at their disposal (orders of magnitude) more training data than a person reads in a lifetime[1].
Available Sources of Large Data
Monolingual
Google n-grams, Google book n-grams, Common Crawl + Moses n-grams
Parallel
OPUS
EU
token vs. type
Obtaining More Data
Crowdsourcing
Social Media
Zipf's Law
References
- ↑ Philipp Koehn. Inaugural lecture.
- ↑ A. Halevy, P. Norvig, F. Pereira. The Unreasonable Effectiveness of Data
- ↑ Jan Hajič, Eva Hajičová. Some of Our Best Friends Are Statisticians