Data Collections Description

CSEN,CSZH,EcoEN,EcoZH are evaluation datasets mentioned in the paper.
All data file are in standard json format.

1. .captions file
Video captions of MOOC courses in the dataset, each line represents a video.
The text has been tokenized and labeled with POS tagging. 
For CSZH and EcoZH, we employ Ansj(https://github.com/NLPchina/ansjseg) to perform word segmentation and POS tagging.
For CSEN and EcoEN, we select the POS tagger implemented by the Stanford NLP group.(http://nlp.stanford.edu/software/tagger.shtml).

2. .candidates file
Candidate course concepts extrated from the dataset.
The "label" field is the human annotated label for a candidate.
"1" stands for a course concept and "0" otherwise.