Managing the aftermath of Hurricane Irma: Machine Learning to the Response

Dear MicroMappers,

Please see the below for UAV Imagery damage assessment via Qatar Computing Research Institute(QCRI)’s Nazr-CNN algorithms. Our summary report is here. Also, see the previous blog.

 

We are still improving the Nazr-CNN Algorithms. please stay tuned!

 

Thank you,

MicroMappers Team,

Managing the aftermath of Hurricane Harvey: Machine Learning to the Rescue

Dear MicroMappers,

Hurricane Harvey was the first major hurricane to make landfall in the United States since Wilma in 2005, ending a record 12-year period with no major hurricanes making landfall in the country. In a four-day period, many areas received more than 40 inches (1,000 mm) of rain as the system meandered over eastern Texas and adjacent waters, causing catastrophic flooding. The resulting floods inundated hundreds of thousands of homes, displaced more than 30,000 people, and prompted more than 17,000 rescues.

Qatar Computing Research Institute(QCRI) has been working on machine learning for UAV(unmanned aerial vehicle – drone) imagery using deep learning algorithms. For Hurricane Harvey, we are able to run our algorithm. Please see the below result. And, here is our report that explains how this works.

 

We are still looking for more data for improving our model so that we can share with the community.

With gratitude,
MicroMappers Team

 

Fine Grained Classification of UAV Imagery for Damage Assessment

Dear MicroMappers,

We are happy to announce that our paper “Fine Grained Classification of UAV Imagery for Damage Assessment” will be presented at DSAA2017 – The 4th IEEE International Conference on Data Science and Advanced Analytics 

We really appreciate your help, especially, Standby Task Force volunteers & Digital Jedis for this research project. This paper is a very special milestone in the UAV research community because it is the first UAV Imagery paper using deep learning algorithms. We are working hard for next research paper based on Philippines expedition dataset.

Stay turns!

Thank you,

MicroMappers team,

 

MicroMappers Philippines Damage Assessment Deployment in UAV Images

Update:  Completed the activation.

Dear MicroMappers,

Thank you for your help! We have completed the activation. These dataset will be used to enhance the computer vision model. Please see the below for the Digital Jedis final activity map.

Screen Shot 2017-07-17 at 1.20.02 PM

 

Update:  Our volunteers’s activity Map

kddMap1

Dear MicroMappers,

We need your help!

We are launching  now the MicroMappers damage assessment expedition to the Philippines! The purpose of this deployment is to develop & delivery the cutting-edge machine learning algorithm. It will be used to automatically detect and categorize damaged infrastructure in UAV images taken in the aftermath of natural disaster. Your help is essential of this new deep learning machine learning algorithm. The algorithm will be released to public as a final production for humanitarian purpose.

To start your digital humanitarian efforts, simply click on the link below:

Tutorial: https://sites.google.com/site/annotationtutorial/
Link: http://clickers.micromappers.org/project/kdd/

Thanks for your help in supporting these important disaster relief efforts!

Your MicroMappers Team,

MicroMappers Hub

Dear MicroMappers,

Early of this year, we have introduced MicroMappers new face, MicroMappers Hub. Since the MicroMappers Hub release, we have been releasing new features slowly.  The current MicroMappers key features are;

  • Twitter streaming and historical search data crawling. Unlike other platform, you can collect historical data + now & coming data based on the keywords.
  • Facebook search on public groups & public pages
  • Gdelt world news download that is associated to any crises. You can download 3w & crises image related data. It is refreshed every 15min. Almost nearby real-time dataset.
  • Current hot issue keyword tracker
  • Sentiment analysis on your twitter data collection – Machine Learning
  • Disambiguation analysis on your twitter data collection – Machine Learning
  • Gdelt Geo 2.0 api integration. Based on your keywords, the system pulls geo based data from gdelt geo2.0.  Gdelt geo 2.0 api is build on Machine learning in NLP, Computer vision, Machine translation. Also, it is based on big data search.

Please see the below for GDELT Image Classifier features in MicroMappers.

Once you login, please see “Disaster Global Events” box. Then, click “Image Classifiers

Screen Shot 2017-05-02 at 8.14.39 AM

Now,  you can see the current image classifier list. If you want to define your own, please click “Request New Image Classifier”. It will redirect to the configuration page.  If you click on “View Map”, you can see the map with images. see the below.

Screen Shot 2017-05-02 at 8.11.15 AM

Screen Shot 2017-05-02 at 8.10.41 AM

This is Image Classifier configuration page. Basically, you need to fill out the form.

Screen Shot 2017-05-02 at 8.11.28 AM

As you can see, amazing features are here. If you are not sure how to start, please check “Tutorial” first.

We want to hear your experience and needs. Please visit the MicroMappers Hub, give us feedback.

Thank you,

MicroMappers Team,

 

QCRI Named Entity Recognition in Tweets

Named Entity Recognition (NER) involves identifying named entities such as persons, locations, and organizations in text.  NER is essential for a variety of Natural Language Processing (NLP), Information Retrieval (IR), and Social Computing (SC) applications. In this blog, I present QCRI’s state-of-the-art Arabic microblogs NER system.

Microblog NER Challenges

NER on microblogs faces many challenges such as:

(1) Microblogs are often characterized by informality of language, ubiquity of spelling mistakes, and the presence of Twitter name mentions (ex. @someone), hashtags, and URL’s;

(2) NE’s are often abbreviated. For example, tweeps (tweet authors) may write “Real Madrid” as just “the Real”;

(3) Tweeps often use brief and choppy expressions and incomplete sentences;

(4) Word senses in tweets may differ than word senses in news. For example, “mary jane” in tweets likely refers to Marijuana as opposed to a person’s name;

(5) Tweeps may inconsistently use capitalization (for English), where capitalized words may not capitalized and ALL CAP words are used for emphasis; and

(6) We observed that NE’s often appear in the beginning or the end of tweets and they are often abbreviated.

As for Arabic microblogs, they exhibit more complications, namely:

  • Tweets may contain transliterated words (ex. “LOL”→ لول) and non-Arabic words, particularly hashtags (ex. #Syria)
  • Arabic lacks capitalization
  • Most named entities that are observed in tweets are unlikely to have been seen during the training of a NER.
  • Tweets frequently use dialects, which may lack spelling standards(ex. معرفتش and ماعرفتش are varying spellings of “I did not know”), introduce a variety of new words (ex. محد means “no one”), or make different lexical choices for concepts (ex. كويس and باهي mean “good”).

Dialects introduce morphological variations with different prefixes and suffixes. For example, Egyptian and Levantine tend to insert the letter ب (sounds like “ba”) before verbs in present tense.

QCRI NER

Most work on NER relies on using a sequence labeler, such as a Conditional Random Fields (CRF) labeler, that relies on a variety of contextual features and gazetteers, which are large lists of named entities.  Our state-of-the-art NER system enhances on the same path by presenting novel ways of building larger gazetteers, applying domain adaptation, using semi-supervised training, performing transliteration mining, and employing cross-lingual English-Arabic resources such as Wikipedia. We train a CRF sequence labeler with these enhancements.

Using Arabic Wikipedia

Since building larger gazetteers can positively impact NER, we used Wikipedia to build large gazetteers.  To do so, we filtered category names to filter Wikipedia titles that would constitute names of persons, locations, and organizations. Here are sample words (translated into English) that we used for filtering:

  • For persons: births, deaths, and living people.
  • For locations: countries, capitals, provinces, states, cities, airports, etc.
  • For organizations: organizations, companies, foundations, institutes, unions, etc.

We also used page redirects (alternative page names) to expand the gazetteers.  The resultant gazetteer had 70,908 locations, 26,391 organizations, and 81,880 persons.

English DBpedia:

DBpedia is a large collaboratively-built knowledge base in which structured information is extracted from Wikipedia, and it contains 6,157,591 Wikipedia titles belonging to 296 types. Types vary in granularity with each Wikipedia title having one or more type. For example, NASA is assigned the following types: Agent, Organization, and Government Agency. In all, DBpedia includes the names of 764k persons, 573k locations, and 192k organizations. Of the Arabic Wikipedia titles, 254,145 have Wikipedia cross-lingual links to English Wikipedia, and of those English Wikipedia titles, 185,531 have entries in DBpedia. We used the DBpedia types as features for the NER system.

Cross-Lingual Capitalization:

As I mentioned earlier, Arabic lacks capitalization and Arabic names are often common Arabic words. For example, the Arabic name “Hasan” means good. To capture cross-lingual capitalization, we used a machine translation phrase table that was built using large amounts of parallel Arabic-English text and where the case was not folded on the English side.  Then given an Arabic word, we would look up its English translation and observe the likelihood that the English translation is capitalized.

Cross-Lingual Transliteration:

Many named entities, particularly persons and locations, are often transliterated.  We would lookup the translations of Arabic words in the aforementioned phrase table and then we determined using an in-house transliteration miner whether the English and Arabic translations are also transliterations or not.  If they are, then we used the transliteration probability as a feature.

Using Domain Adaptation:

Aside from tagging microblog text with named entities, we mixed tagged news texts with tagged microblog text to make use of the large news training data.

Semi-Supervised Training:

Basically, we used our best NER system to tag a large corpus of microblogs. Our intuition was that if we automatically tag a large set of tweets, then a NE may be tagged correctly multiple times. Then, automatically identified NE’s can then be used as a “new gazetteer.”

How Good is our NER System

The QCRI NER system is considered state-of-the-art for Arabic microblogs.  Table 1 reports on the evaluation results for the NER system.  We performed the evaluation on a set of 1,423 tweets containing nearly 26k tokens. The tweets were randomly selected from the period of Nov. 23-27, 2011.

Table 1. NER results

Type Precision Recall F-measure
LOC 83.6 70.8 76.7
ORG 76.4 43.7 55.6
PERS 67.1 47.8 55.8
Overall 76.8 56.6 65.2

 

Obtaining and Citing our NER System

The system is a part of the Farasa Arabic processing toolkit and is available under research license from: http://qatsdemo.cloudapp.net/farasa/ .  It is written entirely in Java and can be invoked as a stand-alone executable or through API.  Usage example are available in: http://qatsdemo.cloudapp.net/farasa/usage.html .

For a detailed description of the system, please refer to the following two papers:

Kareem Darwish. 2013. Named Entity Recognition using Cross-lingual Resources: Arabic as an Example. ACL-2013.

Kareem Darwish, Wei Gao. 2014. Simple Effective Microblog Named Entity Recognition: Arabic as an Example. LREC-2014.