We received a request from Humanitarian OpenStreetMap (HOT) shortly after super storm Sandy hit the northeastern US in October/November 2012. They were interested in deploying the “expert” crowd (GISCorps volunteers) on a crowdsourcing project that they launched shortly after the storm. They asked that the GISCorps volunteers help evaluate the crowd’s overall accuracy, by rating a sample of the site’s images using the same interface. Volunteers came from seven countries: Amelia Ley (US), Naiara Fernandez (Spain), Roxroy Bollers (Guyana), Giedrius Kaveckis, (Italy), Jeffrey Pires, (US), David Anderson (US), Meliv Purzuelo (Philippines), Kevin Pomaski (US), and Eyob Teshome (Ethiopia). The following report describes the details of the project.
Aerial Damage Assessment Following Hurricane Sandy
Jennifer Chan, Harvard Humanitarian Initiative & Northwestern University
John Crowley, Harvard Humanitarian Initiative and Humanitarian OpenStreetMap
Shoreh Elhami, GISCorps Founder
Schuyler Erle, Idibon and Humanitarian OpenStreetMap
Robert Munro, Idibon
Tyler Schnoebelen, Idibon
This document is an after-action report for information processing following Hurricane Sandy, one of the largest disasters to hit urban areas in the past 12 months. We analyze the aerial damage assessment process in a number of ways, reporting the results and suggesting methods to ensure quality and reliability for similar responses to future events.
Following Hurricane Sandy’s landfall on the Eastern seaboard of the USA in 2012, the Civil Air Patrol (CAP) took over 35,000 GPS-tagged images of damage-affected areas. This was performed as part of their mandate to provide aerial photographs for disaster assessment and response agencies, primarily FEMA, who used the aggregate geolocated data for situational awareness.
The scale of the destruction meant that there was a relatively large amount of photographs for a single disaster. As a result, it was the first time that CAP and FEMA used distributed third-party information processing for the damage assessment, with 6,717 public, non-expert volunteers evaluating the level of damage present in the images via an online crowdsourcing system. The contributors viewed one image at a time and gave a three-way judgment: little/no damage; medium damage; or heavy damage. This report is quality of the damage assessment evaluating the volunteer workers’ performance in three ways:
1. Inter-annotator agreement: how often did volunteers agree with each other?
2. Comparison with experts: 11 expert raters from the GISCorps assessed a selection of the images as part of this report (also as volunteers).
3. Ground-truthed ratings: comparison to ratings made by FEMA at the same grid locations.
Additionally, this report evaluates the GISCorps’ volunteer experience to understand motivating factors for skilled volunteer engagement and to learn how to improve crowdsourcing platforms and process for future disaster deployments.
Deployment and Author Involvement
The volunteers used the MapMill software (Warren, 2010), released by the Public Laboratory for Open Technology and Science (PLOTS) and adapted for this task by Humanitarian OpenStreetMap. It was deployed and run by Schuyler Erle (author). The platform was developed at Camp Roberts RELIEF, organized by John Crowley (author), in collaboration with the Civil Air Patrol, FEMA and professionals including Robert Munro (author). Erle’s involvement and subsequent analysis was supported by Idibon staff including Tyler Schnoebelen (author), and the GISCorps volunteers were managed by Shoreh Elhami (author). Jennifer Chan (author) supported both the deployment and analysis.
Agreement among non-experts
Inter-annotator agreement is a common metric for evaluating accuracy in crowdsourced tasks when the “correct” answer is not known. If there is a large amount of agreement about a judgment from multiple crowdsourced workers, then chances are that the shared judgment is the correct one. The public, non-expert volunteers rated a total of 35,535 images—these received, on average, 4.51 ratings each. We restrict the analysis to the 17,070 images that had 3 or more ratings.
As Figure 1 demonstrates, there is a high level of agreement. The public volunteers had majority agreement on 15,968 images (93.54%). And even if we restrict ourselves to a “super-majority” definition of agreement, agreement was still at 80%. That said, there was unanimous agreement on less than 50% of the images, showing that complete agreement was relatively rare.
Figure 1: Three different levels of quality assessment on 17,070 images (limited to images with three or more ratings per image), showing that non-expert volunteers generally agreed with each other on how to classify an image.
The fact that there is a consensus for most images is encouraging. But how high is the quality of the actual raters? For the evaluation, we defined someone as a Good Rater if their own ratings correspond to majority opinions. There are 6,717 public volunteers, they give ratings to an average of 23.86 images (median of 5). There are 34,433 images that have majority agreement. How do the non-expert volunteers do on these?
There are 4,370 non-expert volunteers that we have enough data to evaluate (i.e., they have 3+ ratings for images where there are majority verdicts). It turns out that most of the volunteers agree with each other. The chart below shows how, for example, 3,652 of the 4,370 users agree with the majority verdict for the majority of images that they rate (83.57%).
Figure 2: Focusing on raters, we see that most non-expert raters are consistent with the majority opinions.
For the comparison with experts, 720 of the most problematic images were assessed by 11 GISCorps experts, using the same platform and instruction set. We defined the “tough” images as those with the least agreement between the annotators. The average number of experts/image was 3.18 (median of 3). (The average number of public volunteer judgments for these tough images was 15.6, median of 14).
How much agreement did experts have on these tough cases? In Figure 3, we show that experts generally agree about how to rate tough images unless we hold them to an unrealistic expectation of perfect agreement. 81% of the images had a supermajority agreement among the experts, compared to just 37% for public volunteers, showing that the volunteers were not as accurate (in terms of inter-annotator agreement) for these images.
Figure 4: Raters generally gave out the same kinds of ratings.
The main area of disagreement between the groups were for images that the volunteers said showed no real damage and which the experts said showed some damage (9% of the images). As Table 1 shows, only 11% of the ratings were dramatically off (where one group said there was no damage and the other group said there was severe damage). 63% of the toughest images were agreed upon between experts and non-experts.
Table 1: Rating distributions for the 662 tough images that have majority votes among both groups (i.e., there are 30 images that a majority of experts call “0” but which a majority of public volunteers call “5”).
The truth on the ground
The third evaluation produced a negative result, as we were not able to find a strong correlation between the aerial evaluations and FEMA’s ground-reports. We can identify some grids where this is due to timing: the presence of flood-water was typically marked as high damage, but it had receded before the FEMA assessments (the initial aerial assessments were completed in the first days for immediate resource allocation, insurance assessments, etc., while the ground reports were for more details and pin-pointed exercise.) In other places, there was a mismatch between aerial photographs and grid-references. For example, while CAP ratings applied to a large area, only a small subsection of that would need to be affected for FEMA to call it damaged.
In general, the largest agreement is with mutual 0’s, where there is essentially no damage. This also is true when we look at how the “everyone agrees” patterns with FEMA’s classifications from on-the-ground. The next table also demonstrates that the highest damage ratings from the aerial photographs are only rated as “affected” by people on the ground.
Table 2: Images per FEMA category; ratings are those that both experts and public volunteers agree upon.
Improving the platform
- GIS volunteer skills & background, including prior experience in a disaster context
- User experience with the platform
- Communication & feedback
- Volunteer engagement & sustainability
- Volunteer recommendations
Additional coding was again performed to identify themes that emerged across interview questions as well as issues and topics that emerged across individual interviews.
GIS volunteers skills and backgrounds
User experiences with the platform
Image types and quality
Imagery magnifying glass
Communication and feedback
Volunteer engagement and sustainability
- Hire non-expert microtaskers in order to process the majority of data, as discussed in the previous section.
- Plan on 4-6 non-expert judgments per image.
- Route difficult images to expert annotators (or a larger number of non-experts).
Figure 6: Recommended process
- Add instructions on how to interpret images and assign categories.
- Include options to decline image categorization (e.g., for dark or blurry images).
- Consider showing images within their geographical region and/or clustering images from the same region to be considered together.
- Consider showing pre- and post-disaster images for comparison purposes.
- Consider an initial step to filter out blank/black/otherwise uninterpretable images prior to the damage assessment task.
- Consider investing in the capacity to provide project coordinators who volunteer shifts to provide online support, feedback and other communications to assessors during a deployment.
- Determine better ways to map judgments of aerial data to the on-the-ground assessments that FEMA performs.
Future research directions
- Civil Air Patrol (CAP) assessment analysis: This study would evaluate the experiential expertise of CAP volunteers and their expert assessments compared to the crowd and to GIS remote sensing experts.
- Paid workforce analysis: This study would investigate the potential added value of incorporating paid work forces into future deployments. A comparative analysis of the inter-assessor agreement between paid workers, expert volunteers, and CAP volunteers. The study would also include a design simulation, where paid workforces would be positioned at different workflow stages along with the crowd and experts to determine the optimal use of this type of workforce and at what cost.
- Pre and post disaster imagery analysis: A feasibility to benefit study.
- Stage 1 – A feasibility study on acquiring pre-disaster imagery by CAP as a preparedness activity. The project would begin with selecting 3 US regions most at risk for future disasters. This pre-disaster or baseline imagery could be acquired from exiting databases or potentially investing in CAP fly-overs to acquire these imagery datasets. Determine the capacity, investment and time to order and process imagery and design a pre-post assessment platform.
- Stage 2- A pilot comparison study that analyzes the degree of improvement in pre-post imagery assessments, both by experts and the crowd. This study would also include a cost/benefit analysis for accuracy gained compared to the investment needed to acquire pre/post data as well as design the platform for this specific use.
- Imagery cluster analysis: This study would investigate the potential added value of clustering images for serial and parallel assessment by volunteers. This includes modeling imagery sets by various cluster geographical sizes, and the degree of cluster overlap between volunteers to potentially validate or increase inter-assessor agreement. It would also include a comparative analysis between the crowd, experts and CAP volunteers.
- Combining information from other sources: This would project look for ways to combine aerial analysis with information from official and citizen sources. For example, it might employing Natural Language Processing over social media (Munro, 2012), adding eye-level photographs from ground teams or affected populations, or potentially incorporating other types of crowdsourced information processing. This would allow responders to quickly link the damage assessments to ground-based reports at the same locations.
- Evaluate vehicle accessibility. Vehicular access is vital for disaster response (Dolinskaya et al. 2013) and the process used here could have as easily focused on blocked or damage roads.
- Simulate, learn, and iterate collaborative project: This cross-cutting collaborative project will interface with the above projects over years to integrate design, experimental and learning simulations to complement the above studies. Evaluation methods and designs will be employed to help facilitate learning from each project and translate this into future iterations for practical deployments as well as new areas of research and study.
Crowley, John, and Jennifer Chan. 2011. “DISASTER RELIEF 2.0: The Future of Information Sharing in Humanitarian Emergencies.” Harvard Humanitarian Initiative and UN Foundation-Vodafone Foundation-UNOCHA.
Dolinskaya, Irina, Karen Smilowitz, and Jennifer Chan. 2013. Integration of Real-Time Mapping Technology in Disaster Relief Distribution. Center for the Commercialization of Innovative Transportation Technology. Northwestern University.
Munro, Robert, Schuyler Erle and Tyler Schnoebelen. 2013. Analysis After Action Report for the Crowdsourced Aerial Imagery Assessment Following Hurricane Sandy. 10th International Conference on Information Systems for Crisis Response and Management. Baden Baden, Germany.
Munro, Robert. 2012. Crowdsourcing and Natural Language Processing for Humanitarian Response. Crisis Informatics and Analytics. Tulane.
Munro, Robert. 2013. Crowdsourcing and the Crisis-Affected Community: lessons learned and looking forward from Mission 4636. Journal of Information Retrieval 16(2). Springer.
Warren, Jeffrey Yoo. 2010. Grassroots mapping: tools for participatory and activist cartography. PhD dissertation. Massachusetts Institute of Technology.