Summary: | "This project targets the description of a written Modern Standard Arabic corpus from the Agence France Press (AFP) newswire archives for July-November 2000 (files dated 20000715 to 20001115). This corpus includes 734 stories representing 145,386 words (166,068 tokens after clitic segmentation in the Treebank; the number of Arabic tokens is 123,796). For this work, annotators must be native speakers of Arabic, and they must understand enough linguistics to check morphosyntactic analysis and build syntactic structures."
|