Data Acquisition: Difference between revisions
Jump to navigation
Jump to search
No edit summary |
No edit summary |
||
Line 1: | Line 1: | ||
There seems to be a universal rule for (not only) statistical methods in NLP: ''' | There seems to be a universal rule for (not only) statistical methods in NLP: '''more data is better data.''' <ref name=effectiveness>A. Halevy, P. Norvig, F. Pereira. [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4804817&tag=1 ''The Unreasonable Effectiveness of Data'']</ref><ref name=friends>Jan Hajič, Eva Hajičová. [http://link.springer.com/chapter/10.1007%2F978-3-540-74628-7_2#page-1 ''Some of Our Best Friends Are Statisticians'']</ref> | ||
Translation systems have at their disposal (orders of magnitude) more training data than a person reads in a lifetime<ref name="inaug">Philipp Koehn. [https://www.youtube.com/watch?v=6UVgFjJeFGY Inaugural lecture.]</ref>. | Translation systems have at their disposal (orders of magnitude) more training data than a person reads in a lifetime<ref name="inaug">Philipp Koehn. [https://www.youtube.com/watch?v=6UVgFjJeFGY Inaugural lecture.]</ref>. | ||
== Available Sources of Large Data == | == Available Sources of Large Data == |
Revision as of 16:43, 23 February 2015
There seems to be a universal rule for (not only) statistical methods in NLP: more data is better data. [1][2]
Translation systems have at their disposal (orders of magnitude) more training data than a person reads in a lifetime[3].
Available Sources of Large Data
Monolingual
Google n-grams, Google book n-grams, Common Crawl + Moses n-grams
Parallel
OPUS
EU
token vs. type
Obtaining More Data
Crowdsourcing
Social Media
Zipf's Law
References
- ↑ A. Halevy, P. Norvig, F. Pereira. The Unreasonable Effectiveness of Data
- ↑ Jan Hajič, Eva Hajičová. Some of Our Best Friends Are Statisticians
- ↑ Philipp Koehn. Inaugural lecture.