The 10-K Corpus (extended version)

The 10-K reports from publicly traded U.S. companies from year 1996 to 2013:

Pre-Trained Word Embeddings Vectors

We are publishing pre-trained vectors via word2vec (with the CBOW model) trained on the above 10-K Corpus (40,708 reports from 18 years). The models (w/ and w/o incorporating syntactic information) contain 200-dimensional vectors for each word.

Further Reading

Please cite the first paper (ACM TMIS) if you write any papers involving the use of the data above: