FIN10K

The 10-K Corpus (extended version)

The 10-K reports from publicly traded U.S. companies from year 1996 to 2013:

Stock return volatility measurements in the twelve-month period before and after each report
Abnormal trading volumes (see [Loughran and McDonald 2011] for the detailed definition)
Download (README):
- original 10-K reports [all.full.tgz] (10GB)
- MD&A sections [all.mda.tgz] (760MB)
- tokenized MD&A sections [all.tok.tgz] (533MB)
- log postevent return volatility [all.logfama.tgz] (429KB)
- log stock volatility [all.logvol.tgz] (800KB)
- abnormal trading volumes [all.abnormal.tgz](429K)
- excess returns [all.excess.tgz](405K)
- meta files [all.meta.tgz] (1.2MB)

Pre-Trained Word Embeddings Vectors

We are publishing pre-trained vectors via word2vec (with the CBOW model) trained on the above 10-K Corpus (40,708 reports from 18 years). The models (w/ and w/o incorporating syntactic information) contain 200-dimensional vectors for each word.

Pre-trained word vectors w/o incorporating syntactic information (359M tokens, 0.12M vocab, 200d vectors)
- word vectors [sim.expand.200d.vec.tgz] (47MB)
- bin file [sim.expand.200d.bin.tgz] (51MB)
Pre-trained word vectors w/ incorporating syntactic information (359M tokens, 0.26M vocab, 200d vectors)
- word vectors [syn.expand.200d.vec.tgz] (83MB)
- bin file [syn.expand.200d.bin.tgz] (90MB)

Further Reading

Please cite the first paper (ACM TMIS) if you write any papers involving the use of the data above:

Ming-Feng Tsai, Chuan-Ju Wang, and Po-Chuan Chien. Discovering Finance Keywords via Continuous Space Language Models. To appear in ACM Transactions on Management Information Systems (ACM TMIS).
Ming-Feng Tsai and Chuan-Ju Wang. Financial Keyword Expansion via Continuous Word Vector Representations. Conference on Empirical Methods in Natural Language Processing (EMNLP '14), Doha, 2014, pp. 1453-1458. [paper]

Remarks

These data were collected primarily by Po-Chuan Chien, Chih-Chun Hsia, Yu-Wen Liu, Ming-Feng Tsai, and Chuan-Ju Wang.
Email Chuan-Ju Wang if you have questions about this project.
Parts of the data are from the 10-K Corpus provided by Kogan et al. in 2009.