Data Acquisition: Difference between revisions

Revision as of 16:43, 23 February 2015

There seems to be a universal rule for (not only) statistical methods in NLP: more data is better data. ^[1]^[2]

Translation systems have at their disposal (orders of magnitude) more training data than a person reads in a lifetime^[3].

Google n-grams, Google book n-grams, Common Crawl + Moses n-grams

OPUS

EU

token vs. type

@@ Line 1: / Line 1: @@
-There seems to be a universal rule for (not only) statistical methods in NLP: '''More data is better data.'''
+There seems to be a universal rule for (not only) statistical methods in NLP: '''more data is better data.''' <ref name=effectiveness>A. Halevy, P. Norvig, F. Pereira. [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4804817&tag=1 ''The Unreasonable Effectiveness of Data'']</ref><ref name=friends>Jan Hajič, Eva Hajičová. [http://link.springer.com/chapter/10.1007%2F978-3-540-74628-7_2#page-1 ''Some of Our Best Friends Are Statisticians'']</ref>
 Translation systems have at their disposal (orders of magnitude) more training data than a person reads in a lifetime<ref name="inaug">Philipp Koehn. [https://www.youtube.com/watch?v=6UVgFjJeFGY Inaugural lecture.]</ref>.
-<ref name=effectiveness>A. Halevy, P. Norvig, F. Pereira. [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4804817&tag=1 ''The Unreasonable Effectiveness of Data'']</ref>
-<ref name=friends>Jan Hajič, Eva Hajičová. [http://link.springer.com/chapter/10.1007%2F978-3-540-74628-7_2#page-1 ''Some of Our Best Friends Are Statisticians'']</ref>
 == Available Sources of Large Data ==