web TODO |
Translation systems have at their disposal (orders of magnitude) more training data than a person reads in a lifetime.
Available Sources of Large Data
This is definitely not a list of all possible sources, just a few of the interesting ones.
Common Crawl is an initiative which builds an open repository of crawled web. Moses n-grams are similar to Google n-grams but have been computed on the Common Crawl. There is no pruning, so the data are much larger but can offer more detailed statistics.
Politics has been the motivation for a lot of parallel corpora. Canadian Hansard is published both in French and English and is one of the best-known parallel corpora. EU regulations are published in all official European languages, providing an invaluable language resource (if nothing else).
OPUS is a repository which contains most of the publicly available parallel corpora. The data is available for download in several formats, cleaned and processed with a unified pipeline.
Obtaining More Data
More data, or more data of a specific kind, can be obtained e.g. via crowdsourcing.
People also create large amounts of data every day and a good part of this is published via social media. It is therefore not surprising that some research in NLP focuses on leveraging these interesting new data sources.
Zipf's law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.
Implications of this law can be observed everywhere in NLP. While just a few dozen most frequent words (types) will cover half of all tokens (word occurrences) in a natural language corpus, the tail (infrequent words) is extremely long.
Moreover, even if we collect many times more data than we have at the moment, we will not cover much more of the tail and many infrequent words will remain out-of-vocabulary (OOV) for our NLP system.