Data Acquisition
There seems to be a universal rule for (not only) statistical methods in NLP: More data is better data.
Translation systems have at their disposal (orders of magnitude) more training data than a person reads in a lifetime[1].
token vs. type
References
- ↑ Philipp Koehn. Inaugural lecture.
- ↑ A. Halevy, P. Norvig, F. Pereira. The Unreasonable Effectiveness of Data
- ↑ Jan Hajič, Eva Hajičová. Some of Our Best Friends Are Statisticians