Data Acquisition: Difference between revisions
Jump to navigation
Jump to search
No edit summary |
No edit summary |
||
Line 7: | Line 7: | ||
<ref name=friends>Jan Hajič, Eva Hajičová. [http://link.springer.com/chapter/10.1007%2F978-3-540-74628-7_2#page-1 ''Some of Our Best Friends Are Statisticians'']</ref> | <ref name=friends>Jan Hajič, Eva Hajičová. [http://link.springer.com/chapter/10.1007%2F978-3-540-74628-7_2#page-1 ''Some of Our Best Friends Are Statisticians'']</ref> | ||
== Available Sources of Large Data == | |||
=== Monolingual === | |||
Google n-grams, Google book n-grams, Common Crawl + Moses n-grams | |||
=== Parallel === | |||
OPUS | |||
EU | |||
token vs. type | token vs. type | ||
== Obtaining More Data == | |||
=== Crowdsourcing === | |||
=== Social Media === | |||
== Zipf's Law == | |||
== References == | == References == | ||
<references /> | <references /> |
Revision as of 16:17, 23 February 2015
There seems to be a universal rule for (not only) statistical methods in NLP: More data is better data.
Translation systems have at their disposal (orders of magnitude) more training data than a person reads in a lifetime[1].
Available Sources of Large Data
Monolingual
Google n-grams, Google book n-grams, Common Crawl + Moses n-grams
Parallel
OPUS
EU
token vs. type
Obtaining More Data
Crowdsourcing
Social Media
Zipf's Law
References
- ↑ Philipp Koehn. Inaugural lecture.
- ↑ A. Halevy, P. Norvig, F. Pereira. The Unreasonable Effectiveness of Data
- ↑ Jan Hajič, Eva Hajičová. Some of Our Best Friends Are Statisticians