Data Acquisition

From MT Talks

Revision as of 16:17, 23 February 2015 by Tamchyna (talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Jump to navigation Jump to search

There seems to be a universal rule for (not only) statistical methods in NLP: More data is better data.

Translation systems have at their disposal (orders of magnitude) more training data than a person reads in a lifetime^[1].

Available Sources of Large Data

Monolingual

Google n-grams, Google book n-grams, Common Crawl + Moses n-grams

Parallel

OPUS

EU

token vs. type

Obtaining More Data

Crowdsourcing

Social Media

Zipf's Law

References

↑ Philipp Koehn. Inaugural lecture.
↑ A. Halevy, P. Norvig, F. Pereira. The Unreasonable Effectiveness of Data
↑ Jan Hajič, Eva Hajičová. Some of Our Best Friends Are Statisticians

Retrieved from "https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Data_Acquisition&oldid=327"