The 10-K Corpus (extended version)
The 10-K reports from publicly traded U.S. companies from
year 1996
to 2013:
-
Stock return volatility measurements in the twelve-month
period before and after each report
-
Abnormal trading volumes (see [Loughran and McDonald 2011]
for the detailed definition)
-
Download (README):
Pre-Trained Word Embeddings Vectors
We are publishing pre-trained vectors via
word2vec (with the CBOW model) trained on the
above 10-K Corpus (40,708 reports from 18 years). The models (w/ and w/o
incorporating syntactic information) contain 200-dimensional vectors
for each word.
-
Pre-trained word vectors w/o incorporating syntactic information (359M tokens, 0.12M vocab, 200d vectors)
-
Pre-trained word vectors w/ incorporating syntactic information (359M tokens, 0.26M vocab, 200d vectors)
Further Reading
Please cite the first paper (ACM TMIS) if you write any papers involving the use of the data above:
-
Ming-Feng Tsai, Chuan-Ju Wang, and Po-Chuan Chien. Discovering Finance Keywords via Continuous Space Language Models.
To appear in ACM Transactions on Management Information Systems (ACM TMIS).
-
Ming-Feng Tsai and Chuan-Ju Wang. Financial Keyword Expansion via Continuous Word Vector Representations.
Conference on Empirical Methods in Natural Language Processing
(EMNLP '14), Doha, 2014, pp. 1453-1458. [paper]
Remarks
-
These data were collected primarily by Po-Chuan Chien, Chih-Chun Hsia, Yu-Wen Liu, Ming-Feng Tsai, and Chuan-Ju Wang.
Email Chuan-Ju Wang if you have questions about this project.
-
Parts of the data are from the
10-K Corpus provided by Kogan et al. in 2009.