April 27, 2012

New Resource: Leiden Weibo Corpus

From http://linguistlist.org/issues/23/23-2009.html

The Leiden Weibo Corpus (LWC) is an annotated linguistic 100-million word corpus containing 5.1 million messages from Sina Weibo, China’s premier Twitter-like microblogging service.

The LWC is freely available online at http://lwc.daanvanesch.nl . Data for the LWC was collected in January 2012. As such, it contains many linguistic phenomena that may not be found in older corpora, such as suffixation with "-ing", an aspect marker borrowed from English.

Furthermore, Sina Weibo messages come with valuable meta data, such as the gender of the user and his location. This information allows the LWC to calculate how often words are used in different provinces and cities across China, which is useful for research into lexical variation across China.

Naturally, the LWC also supports searching for single words or grammar patterns, such as "any verb followed by an aspectual particle and then a noun". This feature may also be of interest to students and teachers of Mandarin who are looking for example sentences.

For a full description go to http://linguistlist.org/issues/23/23-2009.html
The corpus is available at http://lwc.daanvanesch.nl

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.