The 10-K Corpus (extended version)
The 10-K reports from publicly traded U.S. companies from year 1996
Stock return volatility measurements in the twelve-month
period before and after each report
Abnormal trading volumes (see [Loughran and McDonald 2011]
for the detailed definition)
Pre-Trained Word Embeddings Vectors
We are publishing pre-trained vectors via word2vec
(with the CBOW model) trained on the
above 10-K Corpus (40,708 reports from 18 years). The models (w/ and w/o
incorporating syntactic information) contain 200-dimensional vectors
for each word.
Pre-trained word vectors w/o incorporating syntactic information (359M tokens, 0.12M vocab, 200d vectors)
Pre-trained word vectors w/ incorporating syntactic information (359M tokens, 0.26M vocab, 200d vectors)
Please cite the first paper (ACM TMIS) if you write any papers involving the use of the data above:
Ming-Feng Tsai, Chuan-Ju Wang, and Po-Chuan Chien. Discovering Finance Keywords via Continuous Space Language Models.
To appear in ACM Transactions on Management Information Systems (ACM TMIS).
Ming-Feng Tsai and Chuan-Ju Wang. Financial Keyword Expansion via Continuous Word Vector Representations.
Conference on Empirical Methods in Natural Language Processing
(EMNLP '14), Doha, 2014, pp. 1453-1458. [paper]
These data were collected primarily by Po-Chuan Chien, Chih-Chun Hsia, Yu-Wen Liu, Ming-Feng Tsai, and Chuan-Ju Wang.
Email Chuan-Ju Wang if you have questions about this project.
Parts of the data are from the
10-K Corpus provided by Kogan et al. in 2009.