MicroMappers Philippines Damage Assessment Deployment in UAV Images

Update:  Our volunteers’s activity Map

kddMap1

Dear MicroMappers,

We need your help!

We are launching  now the MicroMappers damage assessment expedition to the Philippines! The purpose of this deployment is to develop & delivery the cutting-edge machine learning algorithm. It will be used to automatically detect and categorize damaged infrastructure in UAV images taken in the aftermath of natural disaster. Your help is essential of this new deep learning machine learning algorithm. The algorithm will be released to public as a final production for humanitarian purpose.

To start your digital humanitarian efforts, simply click on the link below:

Tutorial: https://sites.google.com/site/annotationtutorial/
Link: http://clickers.micromappers.org/project/kdd/

Thanks for your help in supporting these important disaster relief efforts!

Your MicroMappers Team,

MicroMappers Hub

Dear MicroMappers,

Early of this year, we have introduced MicroMappers new face, MicroMappers Hub. Since the MicroMappers Hub release, we have been releasing new features slowly.  The current MicroMappers key features are;

  • Twitter streaming and historical search data crawling. Unlike other platform, you can collect historical data + now & coming data based on the keywords.
  • Facebook search on public groups & public pages
  • Gdelt world news download that is associated to any crises. You can download 3w & crises image related data. It is refreshed every 15min. Almost nearby real-time dataset.
  • Current hot issue keyword tracker
  • Sentiment analysis on your twitter data collection – Machine Learning
  • Disambiguation analysis on your twitter data collection – Machine Learning
  • Gdelt Geo 2.0 api integration. Based on your keywords, the system pulls geo based data from gdelt geo2.0.  Gdelt geo 2.0 api is build on Machine learning in NLP, Computer vision, Machine translation. Also, it is based on big data search.

Please see the below for GDELT Image Classifier features in MicroMappers.

Once you login, please see “Disaster Global Events” box. Then, click “Image Classifiers

Screen Shot 2017-05-02 at 8.14.39 AM

Now,  you can see the current image classifier list. If you want to define your own, please click “Request New Image Classifier”. It will redirect to the configuration page.  If you click on “View Map”, you can see the map with images. see the below.

Screen Shot 2017-05-02 at 8.11.15 AM

Screen Shot 2017-05-02 at 8.10.41 AM

This is Image Classifier configuration page. Basically, you need to fill out the form.

Screen Shot 2017-05-02 at 8.11.28 AM

As you can see, amazing features are here. If you are not sure how to start, please check “Tutorial” first.

We want to hear your experience and needs. Please visit the MicroMappers Hub, give us feedback.

Thank you,

MicroMappers Team,

 

QCRI Named Entity Recognition in Tweets

Named Entity Recognition (NER) involves identifying named entities such as persons, locations, and organizations in text.  NER is essential for a variety of Natural Language Processing (NLP), Information Retrieval (IR), and Social Computing (SC) applications. In this blog, I present QCRI’s state-of-the-art Arabic microblogs NER system.

Microblog NER Challenges

NER on microblogs faces many challenges such as:

(1) Microblogs are often characterized by informality of language, ubiquity of spelling mistakes, and the presence of Twitter name mentions (ex. @someone), hashtags, and URL’s;

(2) NE’s are often abbreviated. For example, tweeps (tweet authors) may write “Real Madrid” as just “the Real”;

(3) Tweeps often use brief and choppy expressions and incomplete sentences;

(4) Word senses in tweets may differ than word senses in news. For example, “mary jane” in tweets likely refers to Marijuana as opposed to a person’s name;

(5) Tweeps may inconsistently use capitalization (for English), where capitalized words may not capitalized and ALL CAP words are used for emphasis; and

(6) We observed that NE’s often appear in the beginning or the end of tweets and they are often abbreviated.

As for Arabic microblogs, they exhibit more complications, namely:

  • Tweets may contain transliterated words (ex. “LOL”→ لول) and non-Arabic words, particularly hashtags (ex. #Syria)
  • Arabic lacks capitalization
  • Most named entities that are observed in tweets are unlikely to have been seen during the training of a NER.
  • Tweets frequently use dialects, which may lack spelling standards(ex. معرفتش and ماعرفتش are varying spellings of “I did not know”), introduce a variety of new words (ex. محد means “no one”), or make different lexical choices for concepts (ex. كويس and باهي mean “good”).

Dialects introduce morphological variations with different prefixes and suffixes. For example, Egyptian and Levantine tend to insert the letter ب (sounds like “ba”) before verbs in present tense.

QCRI NER

Most work on NER relies on using a sequence labeler, such as a Conditional Random Fields (CRF) labeler, that relies on a variety of contextual features and gazetteers, which are large lists of named entities.  Our state-of-the-art NER system enhances on the same path by presenting novel ways of building larger gazetteers, applying domain adaptation, using semi-supervised training, performing transliteration mining, and employing cross-lingual English-Arabic resources such as Wikipedia. We train a CRF sequence labeler with these enhancements.

Using Arabic Wikipedia

Since building larger gazetteers can positively impact NER, we used Wikipedia to build large gazetteers.  To do so, we filtered category names to filter Wikipedia titles that would constitute names of persons, locations, and organizations. Here are sample words (translated into English) that we used for filtering:

  • For persons: births, deaths, and living people.
  • For locations: countries, capitals, provinces, states, cities, airports, etc.
  • For organizations: organizations, companies, foundations, institutes, unions, etc.

We also used page redirects (alternative page names) to expand the gazetteers.  The resultant gazetteer had 70,908 locations, 26,391 organizations, and 81,880 persons.

English DBpedia:

DBpedia is a large collaboratively-built knowledge base in which structured information is extracted from Wikipedia, and it contains 6,157,591 Wikipedia titles belonging to 296 types. Types vary in granularity with each Wikipedia title having one or more type. For example, NASA is assigned the following types: Agent, Organization, and Government Agency. In all, DBpedia includes the names of 764k persons, 573k locations, and 192k organizations. Of the Arabic Wikipedia titles, 254,145 have Wikipedia cross-lingual links to English Wikipedia, and of those English Wikipedia titles, 185,531 have entries in DBpedia. We used the DBpedia types as features for the NER system.

Cross-Lingual Capitalization:

As I mentioned earlier, Arabic lacks capitalization and Arabic names are often common Arabic words. For example, the Arabic name “Hasan” means good. To capture cross-lingual capitalization, we used a machine translation phrase table that was built using large amounts of parallel Arabic-English text and where the case was not folded on the English side.  Then given an Arabic word, we would look up its English translation and observe the likelihood that the English translation is capitalized.

Cross-Lingual Transliteration:

Many named entities, particularly persons and locations, are often transliterated.  We would lookup the translations of Arabic words in the aforementioned phrase table and then we determined using an in-house transliteration miner whether the English and Arabic translations are also transliterations or not.  If they are, then we used the transliteration probability as a feature.

Using Domain Adaptation:

Aside from tagging microblog text with named entities, we mixed tagged news texts with tagged microblog text to make use of the large news training data.

Semi-Supervised Training:

Basically, we used our best NER system to tag a large corpus of microblogs. Our intuition was that if we automatically tag a large set of tweets, then a NE may be tagged correctly multiple times. Then, automatically identified NE’s can then be used as a “new gazetteer.”

How Good is our NER System

The QCRI NER system is considered state-of-the-art for Arabic microblogs.  Table 1 reports on the evaluation results for the NER system.  We performed the evaluation on a set of 1,423 tweets containing nearly 26k tokens. The tweets were randomly selected from the period of Nov. 23-27, 2011.

Table 1. NER results

Type Precision Recall F-measure
LOC 83.6 70.8 76.7
ORG 76.4 43.7 55.6
PERS 67.1 47.8 55.8
Overall 76.8 56.6 65.2

 

Obtaining and Citing our NER System

The system is a part of the Farasa Arabic processing toolkit and is available under research license from: http://qatsdemo.cloudapp.net/farasa/ .  It is written entirely in Java and can be invoked as a stand-alone executable or through API.  Usage example are available in: http://qatsdemo.cloudapp.net/farasa/usage.html .

For a detailed description of the system, please refer to the following two papers:

Kareem Darwish. 2013. Named Entity Recognition using Cross-lingual Resources: Arabic as an Example. ACL-2013.

Kareem Darwish, Wei Gao. 2014. Simple Effective Microblog Named Entity Recognition: Arabic as an Example. LREC-2014.