Sort all the words in the books stored in Gutenberg project (http://www.gutenberg.org) – 42144 books, 16 billions characters, 23929 authors,2'880'579'249 words - in july 2010. Create a dictionary of the 100000 most frequently used English words in the litterature from XVth to XIXth century (definitions by wordnet 3.0). Download the dictionary (6000 pages) List of the 23929 authors Calculation took 96 hours and found a total of 2647659 unique words in the analyzed texts.100000 words represents 2854175206 words in the global text (99,08%). To understand 50% of the 42144 books vocabulary, you need to know the first 93 words, for 70% the first 696, for 90% the first 6428, for 95% the first 14736.This gadget analyzes your vocabulary to create a frequential dictionary of the words you use the most. Shakespeare used 30264 words in his complete work. To know what are the most used words in a text, paste it here then run analysis.
The 100000 first words follow the Zipf law:
This chart shows the number of words versus the rank.
In French, the number of books in Gutenberg Project is limited (139'837'771 words) and represents 538086 unique words. You need to know the first 89 words to understand 50% of the words, for 70% the first 795, for 90% the first 9050, for 95% the first 21231.Download the 30000 most frequent words in French.