Data Acquisition: Difference between revisions

From MT Talks
Jump to navigation Jump to search
mNo edit summary
No edit summary
Line 4: Line 4:


== Available Sources of Large Data ==
== Available Sources of Large Data ==
This is definitely a list of all possible sources, just a few of the interesting ones.


=== Monolingual ===
=== Monolingual ===


Google n-grams, Google book n-grams, Common Crawl + Moses n-grams
Google released [http://googleresearch.blogspot.cz/2006/08/all-our-n-gram-are-belong-to-you.html n-grams of the whole web] and of [http://storage.googleapis.com/books/ngrams/books/datasetsv2.html Google Books].
 
[http://commoncrawl.org/ Common Crawl] is an initiative which builds an open repository of crawled web. [http://www.statmt.org/ngrams/ Moses n-grams] are similar to Google n-grams but have been computed on the Common Crawl. There is no pruning, so the data are much larger but can offer more detailed statistics.


=== Parallel ===
=== Parallel ===


OPUS
Politics has been the motivation for a lot of parallel corpora. Canadian Hansard is published both in French and English and is one of the best-known parallel corpora. [http://eur-lex.europa.eu/legal-content/EN-CS/TXT/?uri=CELEX:52011XC0506(05)&from=CS EU regulations] are published in all official European languages, providing an invaluable language resource (if nothing else).


EU
[http://opus.lingfil.uu.se/ OPUS] is a repository which contains most of the publicly available parallel corpora. The data is available for download in several formats, cleaned a processed with a unified pipeline.


token vs. type
token vs. type

Revision as of 17:12, 23 February 2015

There seems to be a universal rule for (not only) statistical methods in NLP: more data is better data. [1][2]

Translation systems have at their disposal (orders of magnitude) more training data than a person reads in a lifetime.[3]

Available Sources of Large Data

This is definitely a list of all possible sources, just a few of the interesting ones.

Monolingual

Google released n-grams of the whole web and of Google Books.

Common Crawl is an initiative which builds an open repository of crawled web. Moses n-grams are similar to Google n-grams but have been computed on the Common Crawl. There is no pruning, so the data are much larger but can offer more detailed statistics.

Parallel

Politics has been the motivation for a lot of parallel corpora. Canadian Hansard is published both in French and English and is one of the best-known parallel corpora. EU regulations are published in all official European languages, providing an invaluable language resource (if nothing else).

OPUS is a repository which contains most of the publicly available parallel corpora. The data is available for download in several formats, cleaned a processed with a unified pipeline.

token vs. type

Obtaining More Data

Crowdsourcing

Social Media

Zipf's Law

References

  1. A. Halevy, P. Norvig, F. Pereira. The Unreasonable Effectiveness of Data
  2. Jan Hajič, Eva Hajičová. Some of Our Best Friends Are Statisticians
  3. Philipp Koehn. Inaugural lecture.