Subset of the Celva.Sp dataset, which contains 671 learner writings that :
    1) are written in English
    2) have a assessment of the CEFR level done via the DIALANG test

The dataset is a table, with each row containing one student writing, and the columns being the following :
    • Nb annees L2: Number of years studying L2
    • L1: Native language
    • Domaine de specialite: Academic domain of the learner
    • Sejours duree semaines: Total number of weeks spent in English speaking countries 
    • Sejours frequence: Number of trips
    • Lang exposition: Out-of-class exposure to L2 English (movies, radio ...) 
    • Note dialang ecrit: CEFR class with the DIALANG test 
    • Lecture regularite: Reading frequency (daily, weekly, montly) 
    • autre langue: Other L2 being learnt
    • tache ecrit: Identifier of writing task (this subset contains only one task)
    • Texte etudiant: Texts written by students
    • Date ajout: Date of writing
    • pseudo: Pseudonymised ID of learner
The pair (Date ajout, pseudo) identifies uniquely a writing.

data_public_anglais_annotated_CEFR_dialang.csv

Subset of the Celva.Sp, identical to data_public_anglais_annotated_CEFR_dialang.csv

This table contains one more column, conllu_text. It stores for each learner writing the tokenization, tagging, lemmatization and dependency parsing of the raw text created with UDPipe, under string CONLL-U format.

data_public_anglais_annotated_CEFR_dialang_with_conllu.csv

Site en cours de construction.&nbsp;
Merci de vous reporter à l'article suivant:&nbsp;
&nbsp;
Mallart, C., Simpkin, A., Venant, R., Ballier, N., Stearns, B., Li, J. Y., &amp; Gaillat, T. (2023). A new learner language data set for the study of English for Specific Purposes at university level. Proceedings of the 4th Conference on Language, Data and Knowledge - LDK 2023, 1, 281–287. <a href="https://hal.science/hal-04247635" target="_self">https://hal.science/hal-04247635</a>&nbsp;

Corpus d'Etude des Langues Vivantes Appliquées à une Spécialité

Mots-clés	learner corpus L2 english for specific purposes corpus linguisitics
Auteur :	Thomas GAILLAT
titre	Learner language data set for the study of English for Specific Purposes
http://nakala.fr/terms#created	2023-03-20
licence	CC-BY-4.0
type	http://purl.org/coar/resource_type/c_ddb1
mots-clés	learner corpus
mots-clés	L2
mots-clés	english for specific purposes
mots-clés	corpus linguisitics
descriptionen	This data set aims at the study of English as a second language (L2) in learners studying specific acedemic domains. Are included 671 texts written by students of various academic domains in a French university. All learners responded to the same task prompt designed to elicit language related to their specific domain, and had their CEFR level assessed with the DIALANG test. The data set includes structured textual data with rich Universal-Dependency linguistic annotation and metadata. This data set can be used in several types on NLP tasks, to gain insight on the learning of English as L2. This data is collected as part of the Analytics for Language Learning project (A4LL) – ANR-22-CE38-0015-01
langues	en
auteur	Thomas GAILLAT

Corpus d'Etude des Langues Vivantes Appliquées à une Spécialité

Learner language data set for the study of English for Specific Purposes

Données

Visualisation