Named Entity Recognition (NER) involves identifying named entities such as persons, locations, and organizations in text. NER is essential for a variety of Natural Language Processing (NLP), Information Retrieval (IR), and Social Computing (SC) applications. In this blog, I present QCRI’s state-of-the-art Arabic microblogs NER system.
Microblog NER Challenges
NER on microblogs faces many challenges such as:
(1) Microblogs are often characterized by informality of language, ubiquity of spelling mistakes, and the presence of Twitter name mentions (ex. @someone), hashtags, and URL’s;
(2) NE’s are often abbreviated. For example, tweeps (tweet authors) may write “Real Madrid” as just “the Real”;
(3) Tweeps often use brief and choppy expressions and incomplete sentences;
(4) Word senses in tweets may differ than word senses in news. For example, “mary jane” in tweets likely refers to Marijuana as opposed to a person’s name;
(5) Tweeps may inconsistently use capitalization (for English), where capitalized words may not capitalized and ALL CAP words are used for emphasis; and
(6) We observed that NE’s often appear in the beginning or the end of tweets and they are often abbreviated.
As for Arabic microblogs, they exhibit more complications, namely:
- Tweets may contain transliterated words (ex. “LOL”→ لول) and non-Arabic words, particularly hashtags (ex. #Syria)
- Arabic lacks capitalization
- Most named entities that are observed in tweets are unlikely to have been seen during the training of a NER.
- Tweets frequently use dialects, which may lack spelling standards(ex. معرفتش and ماعرفتش are varying spellings of “I did not know”), introduce a variety of new words (ex. محد means “no one”), or make different lexical choices for concepts (ex. كويس and باهي mean “good”).
Dialects introduce morphological variations with different preﬁxes and sufﬁxes. For example, Egyptian and Levantine tend to insert the letter ب (sounds like “ba”) before verbs in present tense.
Most work on NER relies on using a sequence labeler, such as a Conditional Random Fields (CRF) labeler, that relies on a variety of contextual features and gazetteers, which are large lists of named entities. Our state-of-the-art NER system enhances on the same path by presenting novel ways of building larger gazetteers, applying domain adaptation, using semi-supervised training, performing transliteration mining, and employing cross-lingual English-Arabic resources such as Wikipedia. We train a CRF sequence labeler with these enhancements.
Using Arabic Wikipedia
Since building larger gazetteers can positively impact NER, we used Wikipedia to build large gazetteers. To do so, we filtered category names to filter Wikipedia titles that would constitute names of persons, locations, and organizations. Here are sample words (translated into English) that we used for filtering:
- For persons: births, deaths, and living people.
- For locations: countries, capitals, provinces, states, cities, airports, etc.
- For organizations: organizations, companies, foundations, institutes, unions, etc.
We also used page redirects (alternative page names) to expand the gazetteers. The resultant gazetteer had 70,908 locations, 26,391 organizations, and 81,880 persons.
DBpedia is a large collaboratively-built knowledge base in which structured information is extracted from Wikipedia, and it contains 6,157,591 Wikipedia titles belonging to 296 types. Types vary in granularity with each Wikipedia title having one or more type. For example, NASA is assigned the following types: Agent, Organization, and Government Agency. In all, DBpedia includes the names of 764k persons, 573k locations, and 192k organizations. Of the Arabic Wikipedia titles, 254,145 have Wikipedia cross-lingual links to English Wikipedia, and of those English Wikipedia titles, 185,531 have entries in DBpedia. We used the DBpedia types as features for the NER system.
As I mentioned earlier, Arabic lacks capitalization and Arabic names are often common Arabic words. For example, the Arabic name “Hasan” means good. To capture cross-lingual capitalization, we used a machine translation phrase table that was built using large amounts of parallel Arabic-English text and where the case was not folded on the English side. Then given an Arabic word, we would look up its English translation and observe the likelihood that the English translation is capitalized.
Many named entities, particularly persons and locations, are often transliterated. We would lookup the translations of Arabic words in the aforementioned phrase table and then we determined using an in-house transliteration miner whether the English and Arabic translations are also transliterations or not. If they are, then we used the transliteration probability as a feature.
Using Domain Adaptation:
Aside from tagging microblog text with named entities, we mixed tagged news texts with tagged microblog text to make use of the large news training data.
Basically, we used our best NER system to tag a large corpus of microblogs. Our intuition was that if we automatically tag a large set of tweets, then a NE may be tagged correctly multiple times. Then, automatically identiﬁed NE’s can then be used as a “new gazetteer.”
How Good is our NER System
The QCRI NER system is considered state-of-the-art for Arabic microblogs. Table 1 reports on the evaluation results for the NER system. We performed the evaluation on a set of 1,423 tweets containing nearly 26k tokens. The tweets were randomly selected from the period of Nov. 23-27, 2011.
Table 1. NER results
Obtaining and Citing our NER System
The system is a part of the Farasa Arabic processing toolkit and is available under research license from: http://qatsdemo.cloudapp.net/farasa/ . It is written entirely in Java and can be invoked as a stand-alone executable or through API. Usage example are available in: http://qatsdemo.cloudapp.net/farasa/usage.html .
For a detailed description of the system, please refer to the following two papers:
Kareem Darwish. 2013. Named Entity Recognition using Cross-lingual Resources: Arabic as an Example. ACL-2013.
Kareem Darwish, Wei Gao. 2014. Simple Effective Microblog Named Entity Recognition: Arabic as an Example. LREC-2014.