Data Acquisition: Difference between revisions

From MT Talks
Jump to navigation Jump to search
mNo edit summary
(updated link)
 
(5 intermediate revisions by one other user not shown)
Line 1: Line 1:
{{Infobox
|title = Lecture 6: Data Acquisition
|image = [[File:bigbrother.png|200px]]
|label1 = Lecture video:
|data1 = [http://example.com web '''TODO'''] <br/> [https://www.youtube.com/watch?v=7obaii5xldQ&list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V&index=6 Youtube]
}}
{{#ev:youtube|https://www.youtube.com/watch?v=7obaii5xldQ|800|center}}
There seems to be a universal rule for (not only) statistical methods in NLP: '''more data is better data.''' <ref name=effectiveness>A. Halevy, P. Norvig, F. Pereira. [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4804817&tag=1 ''The Unreasonable Effectiveness of Data'']</ref><ref name=friends>Jan Hajič, Eva Hajičová. [http://link.springer.com/chapter/10.1007%2F978-3-540-74628-7_2#page-1 ''Some of Our Best Friends Are Statisticians'']</ref>
There seems to be a universal rule for (not only) statistical methods in NLP: '''more data is better data.''' <ref name=effectiveness>A. Halevy, P. Norvig, F. Pereira. [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4804817&tag=1 ''The Unreasonable Effectiveness of Data'']</ref><ref name=friends>Jan Hajič, Eva Hajičová. [http://link.springer.com/chapter/10.1007%2F978-3-540-74628-7_2#page-1 ''Some of Our Best Friends Are Statisticians'']</ref>


Line 4: Line 13:


== Available Sources of Large Data ==
== Available Sources of Large Data ==
This is definitely not a list of all possible sources, just a few of the interesting ones.


=== Monolingual ===
=== Monolingual ===


Google n-grams, Google book n-grams, Common Crawl + Moses n-grams
Google released [http://googleresearch.blogspot.cz/2006/08/all-our-n-gram-are-belong-to-you.html n-grams of the whole web] and of [http://storage.googleapis.com/books/ngrams/books/datasetsv2.html Google Books].
 
[http://commoncrawl.org/ Common Crawl] is an initiative which builds an open repository of crawled web. [http://www.statmt.org/ngrams/ Moses n-grams] are similar to Google n-grams but have been computed on the Common Crawl. There is no pruning, so the data are much larger but can offer more detailed statistics.


=== Parallel ===
=== Parallel ===


OPUS
Politics has been the motivation for a lot of parallel corpora. Canadian Hansard is published both in French and English and is one of the best-known parallel corpora. [https://ec.europa.eu/jrc/en/language-technologies EU regulations] are published in all official European languages, providing an invaluable language resource (if nothing else).


EU
[http://opus.lingfil.uu.se/ OPUS] is a repository which contains most of the publicly available parallel corpora. The data is available for download in several formats, cleaned and processed with a unified pipeline.
 
token vs. type


== Obtaining More Data ==
== Obtaining More Data ==


=== Crowdsourcing ===
More data, or more data of a specific kind, can be obtained e.g. via '''crowdsourcing'''.


=== Social Media ===
People also create large amounts of data every day and a good part of this is published via '''social media'''. It is therefore not surprising that some research in NLP focuses on leveraging these interesting new data sources.


== Zipf's Law ==
== Zipf's Law ==


[http://en.wikipedia.org/wiki/Zipf%27s_law Zipf's law] states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.
Implications of this law can be observed everywhere in NLP. While just a few dozen most frequent words (''types'') will cover half of all ''tokens'' (word occurrences) in a natural language corpus, the tail (infrequent words) is extremely long.
Moreover, even if we collect many times more data than we have at the moment, we will not cover much more of the tail and many infrequent words will remain ''out-of-vocabulary'' (OOV) for our NLP system.


== References ==
== References ==


<references />
<references />

Latest revision as of 22:00, 24 February 2015

Lecture 6: Data Acquisition
Lecture video: web TODO
Youtube

{{#ev:youtube|https://www.youtube.com/watch?v=7obaii5xldQ%7C800%7Ccenter}}

There seems to be a universal rule for (not only) statistical methods in NLP: more data is better data. [1][2]

Translation systems have at their disposal (orders of magnitude) more training data than a person reads in a lifetime.[3]

Available Sources of Large Data

This is definitely not a list of all possible sources, just a few of the interesting ones.

Monolingual

Google released n-grams of the whole web and of Google Books.

Common Crawl is an initiative which builds an open repository of crawled web. Moses n-grams are similar to Google n-grams but have been computed on the Common Crawl. There is no pruning, so the data are much larger but can offer more detailed statistics.

Parallel

Politics has been the motivation for a lot of parallel corpora. Canadian Hansard is published both in French and English and is one of the best-known parallel corpora. EU regulations are published in all official European languages, providing an invaluable language resource (if nothing else).

OPUS is a repository which contains most of the publicly available parallel corpora. The data is available for download in several formats, cleaned and processed with a unified pipeline.

Obtaining More Data

More data, or more data of a specific kind, can be obtained e.g. via crowdsourcing.

People also create large amounts of data every day and a good part of this is published via social media. It is therefore not surprising that some research in NLP focuses on leveraging these interesting new data sources.

Zipf's Law

Zipf's law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.

Implications of this law can be observed everywhere in NLP. While just a few dozen most frequent words (types) will cover half of all tokens (word occurrences) in a natural language corpus, the tail (infrequent words) is extremely long.

Moreover, even if we collect many times more data than we have at the moment, we will not cover much more of the tail and many infrequent words will remain out-of-vocabulary (OOV) for our NLP system.

References

  1. A. Halevy, P. Norvig, F. Pereira. The Unreasonable Effectiveness of Data
  2. Jan Hajič, Eva Hajičová. Some of Our Best Friends Are Statisticians
  3. Philipp Koehn. Inaugural lecture.