Data Acquisition: Difference between revisions

From MT Talks
Jump to navigation Jump to search
No edit summary
No edit summary
Line 7: Line 7:
<ref name=friends>Jan Hajič, Eva Hajičová. [http://link.springer.com/chapter/10.1007%2F978-3-540-74628-7_2#page-1 ''Some of Our Best Friends Are Statisticians'']</ref>
<ref name=friends>Jan Hajič, Eva Hajičová. [http://link.springer.com/chapter/10.1007%2F978-3-540-74628-7_2#page-1 ''Some of Our Best Friends Are Statisticians'']</ref>


== Available Sources of Large Data ==
=== Monolingual ===
Google n-grams, Google book n-grams, Common Crawl + Moses n-grams
=== Parallel ===
OPUS
EU


token vs. type
token vs. type
== Obtaining More Data ==
=== Crowdsourcing ===
=== Social Media ===
== Zipf's Law ==


== References ==
== References ==


<references />
<references />

Revision as of 16:17, 23 February 2015

There seems to be a universal rule for (not only) statistical methods in NLP: More data is better data.

Translation systems have at their disposal (orders of magnitude) more training data than a person reads in a lifetime[1].

[2]

[3]

Available Sources of Large Data

Monolingual

Google n-grams, Google book n-grams, Common Crawl + Moses n-grams

Parallel

OPUS

EU

token vs. type

Obtaining More Data

Crowdsourcing

Social Media

Zipf's Law

References

  1. Philipp Koehn. Inaugural lecture.
  2. A. Halevy, P. Norvig, F. Pereira. The Unreasonable Effectiveness of Data
  3. Jan Hajič, Eva Hajičová. Some of Our Best Friends Are Statisticians