Data Acquisition: Difference between revisions

From MT Talks
Jump to navigation Jump to search
No edit summary
mNo edit summary
Line 1: Line 1:
There seems to be a universal rule for (not only) statistical methods in NLP: '''more data is better data.''' <ref name=effectiveness>A. Halevy, P. Norvig, F. Pereira. [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4804817&tag=1 ''The Unreasonable Effectiveness of Data'']</ref><ref name=friends>Jan Hajič, Eva Hajičová. [http://link.springer.com/chapter/10.1007%2F978-3-540-74628-7_2#page-1 ''Some of Our Best Friends Are Statisticians'']</ref>
There seems to be a universal rule for (not only) statistical methods in NLP: '''more data is better data.''' <ref name=effectiveness>A. Halevy, P. Norvig, F. Pereira. [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4804817&tag=1 ''The Unreasonable Effectiveness of Data'']</ref><ref name=friends>Jan Hajič, Eva Hajičová. [http://link.springer.com/chapter/10.1007%2F978-3-540-74628-7_2#page-1 ''Some of Our Best Friends Are Statisticians'']</ref>


Translation systems have at their disposal (orders of magnitude) more training data than a person reads in a lifetime<ref name="inaug">Philipp Koehn. [https://www.youtube.com/watch?v=6UVgFjJeFGY Inaugural lecture.]</ref>.
Translation systems have at their disposal (orders of magnitude) more training data than a person reads in a lifetime.<ref name="inaug">Philipp Koehn. [https://www.youtube.com/watch?v=6UVgFjJeFGY Inaugural lecture.]</ref>


== Available Sources of Large Data ==
== Available Sources of Large Data ==

Revision as of 16:48, 23 February 2015

There seems to be a universal rule for (not only) statistical methods in NLP: more data is better data. [1][2]

Translation systems have at their disposal (orders of magnitude) more training data than a person reads in a lifetime.[3]

Available Sources of Large Data

Monolingual

Google n-grams, Google book n-grams, Common Crawl + Moses n-grams

Parallel

OPUS

EU

token vs. type

Obtaining More Data

Crowdsourcing

Social Media

Zipf's Law

References

  1. A. Halevy, P. Norvig, F. Pereira. The Unreasonable Effectiveness of Data
  2. Jan Hajič, Eva Hajičová. Some of Our Best Friends Are Statisticians
  3. Philipp Koehn. Inaugural lecture.