USAID Crowdsourcing project (a 3-phased project)
By David Litke (GISCorps volunteer) with additions from the USAID case study with permission from USAID
Earlier this year, the United States Agency for International Development (USAID) requested the assistance of GISCorps volunteers for USAID’s first-ever crowdsourcing event to open and map data. USAID wanted to geo-code non-standard location information of loans made thanks to the support of USAID’s Development Credit Authority (DCA). USAID was interested in this project because proper geo-visualization can signal new areas for potential collaboration with host countries, researchers, other development organizations, and the public.
USAID identified a global USAID dataset of over 100,000 non-standardized records that needed to be geo-coded Although thousands of these records had been already geo-located (by using various automated tools), there were still nearly 10,000 in need of geo-coding. In order to accomplish that, USAID pursued USAID’s first-ever crowd-sourcing solution for the data cleanup. The project reflects the Agency’s efforts in increasing transparency and access to these types of datasets.
Multiple partners participated in the project including Standby Task Force (SBTF), GISCorps, Socrata, ESRI, and Data.gov. Initially, over 75 GISCorps volunteers from 17 countries responded to the call for three phases of this project and at the end, the feedback from the deployed volunteers (58) reported over 500 hours of service for this project (some volunteers worked on multiple phases). The following report summarizes their efforts in each phase. All phases of this project were ably led by David Litke, a veteran GISCorps volunteer of several other missions. Other volunteers include:
Isabel Canete-Medina, Stephen Borders, Natasha Herne, Olushola Yakubu, Brian James Baldwin, Paul Giers, Assumita Chiremba, John Foster, Khalid Duri, Geraldine Eggermont, Manuel Fiol, Peddi Asok, Matt Weller, Thanh Le Minh, Penjani Hopkins Nyimbili, Alex Woldemichael, Ali Rehmat, Kari Buckvold, Chris Kleinhofs, Elisabeth Eveleigh, James Smith, Darryl Clare, Michael Aughenbaugh, Nadeem Fareed, Maujakakana Rutjani, Mary Meade, Jeff Fennell, Michelle Boivin, Maude Tholly, Augustine Boamah, Richard Mutambuli, Janet Vaughn, Chuck Gooley, Jamie Hughes, Adongo Clare, Arnold Martey, Adam Guo, Joy Straley, Joe Dickinson, Brenda Rahal, Amanuel Tesfay, Richard Monteiro, Kyung Kim, Renato Machado, Eric Grimison, Nahum-Obed Sanchez, Suchern Ong, Shibata Takeo, Rajat Rajbhandari, Braulio Medina, Muhammad Ameen, Sumit Sharma, Eyob Teshome, Bart Monne, Rachel Starner, Albert Decatur.
Phase 1: Crowdsourcing
Partners involved: SBTF, GISCorps, Socrata, Esri.
This phase involved various tools and means of communications from Google docs to Skype channels that acted like a ‘call center’. Several other tools also became available to the volunteers for geo-coding including Data.gov as the crowdsouring platform, Esri’s custom Admin1 geo-coder, and a Socrata application which allowed users to check out up to ten individual records from the database at a time for processing. Records had to be geolocated at the first Administrative Division level (Admin1), which are the largest administrative divisions within a country. In all, 46 GISCorps volunteers signed up for the first phase, however, not every one of them was able to participate mainly because the work was completed in in only 16 hours.
Esri’s Geo-coding application
Phase 2: Data Processing and Mapping
Partner involved: DOD & GISCorps
12 GISCorps volunteers participated in this phase to geo-code 69 remaining records from phase one that were not completed due to bugs in the application.
Phase 3: Accuracy Assessment & Quality Control Partner involved:
Partners involved: GISCorps. A total of 17 volunteers participated in this phase.
Records were geolocated using two methods: an Automated method, and a Crowdsourcing method. The purpose of Phase 3 of this project was to access the accuracy of both methods.
The Admin1 Geolocation Process
Geographic information within a DCA record that can be used to geolocate the record consists of the Country Name and an OriginalLocation free-form text field, which can contain any combination of street address, city/town name, and Admin1 name. An initial scan of the DCA database found that about roughly 30,000 records had no OriginalLocation information, which left approximately 87,000 records as candidates for Admin1 level geolocation. In order to accomplish a geolocation, the geographic information in the OriginalLocation field is compared against an authoritative database of valid Admin1 Names; if a match is found, then the record has been geolocated. If a match is not found, a search can also be made against an authoritative database of city names; if a match is found then there is a good probability of determining an Admin1 name, but there is less certainty in this type of determination because there may be multiple cities of the same name within a country but within different Admin1 units. In this case more research may be needed to geolocate the record, such as using a street name to determine which exact city the record refers to.
There are multiple authoratative databases available of geographic names (gazetteers). These databases contain administrative boundary names and city/town names as well as other named map features. Those used for this project include:
National Geospatial Information Agency (NGA) Geographic Names Database. The official US Government database of worldwide geographic features.
Geonames.org GeoNames Database. A non-profit open-source database of worldwide geographic names; the NGA database as well as many other databases as its source material.
ESRI Databases. ESRI developed a geolocating application for this project which used ESRI worldwide databases.
Google Geocoding Database. A proprietary database maintained by Google but made available for searches by the public through Google’s Google Maps web page and through the Google Geocoding API.
Because a DCA record often contains an Admin1 name and a city/town name within the Original Location field, it was recognized as feasible to develop an automated process that used a computer script to parse out the Admin1 name and/or city/town name and validate them against an authoritative database. The script first looked for matches for city/town and Admin1 names within the specified country against the NGA database. If no match was found, the text of OriginalLocation was input to the Google Geocoding API to see if it would return an Admin1 name that was valid in the NGA database. The Automated method geolocated a large majority of the records.
The Crowdsourcing method was initiated as a cost-efficient and participatory way of geolocating the records for which the Automated method had failed. The Crowdsourcing method (Phase 1) provided an online link to the dataset and a recommended workflow for volunteers to use; more details on the Crowdsourcing process can be found in the USAID report “Case Study—How USAID Utilized Crowd Sourcing to Open Data.” The recommended workflow suggested that volunteers use all four of the authoritative databases listed above, as well as local knowledge and any other resources that the volunteers had at hand.
Geoname Search Interface
Phase 3 Design
Volunteers in phase 3 were tasked with creating a Quality Control dataset of high-quality geolocated records with which to do an accuracy assessment of the Automated and Crowdsourcing methods of geolocation. A random sample of records was drawn from both datasets; 382 records were drawn from the Automated database, and 322 records were drawn from the Crowdsourcing database. These sample sizes were chosen to ensure that sample estimates would correctly represent population metrics. The 17 Phase 3 participants were selected from among highly-experienced GIS professionals in GISCorps; participants have an average of 8 years of GIS experience. In addition to professional experience, participants were chosen who had experience in this specific geolocating process: of the 17 Phase 3 participants, 13 had taken part in previous phases. In addition, participants were preferentially assigned records for countries in which they had personal experience, or spoke the language of the country. Participants were instructed to geolocate records with the greatest possible care, since their results were to be considered the gold standard. Phase 3 participants used the same geolocating resources as were used for Phases 1 and 2. Participants were not exposed to the earlier Automated or Crowdsourced results for geolocated records, so as to not bias their determinations. Participants were asked to quantify the Difficulty and Certainty of their determinations based on a 1 to 5 point scale. For example, a Difficulty rank of 1 indicated that correctly spelled city/town name and Admin1 name were present in OriginalLocation, while a Difficulty rank of 5 indicated that neither city/town name or Admin1 name were present and had to be inferred; and a Certainty rank of 5 indicated that the volunteer was ~100% sure of the Admin1 assignment, while a Certainty rank of 1 indicated that the assigned Admin1 name was a best guess.
Phase 3 Results
Accuracy of results was calculated by comparing the QC dataset Admin1 Codes with the previously determined Admin1 Codes; the Codes were used rather than the Admin1 Names, because there is some variation in the spelling of Admin1 Names among the three geolocation resources. The Automated method was found to be 64 percent accurate, while the Crowdsourcing method was found to be 85 percent accurate.
Automated Method. Of the 382 records in the QC dataset for the Automated method, 244 were in agreement with the Automated method results (accuracy of 64 percent), and 138 were in disagreement. The median Certainty rating of records in the QC database (the degree to which volunteers were sure of their assignments was 5 (the highest rating of Certainty), so it is highly certain that the Automated method results were inaccurate for these records. The median Difficulty ranking of records in the QC dataset was 2, which indicates that the OriginalLocation field contained a valid Admin1 name or City/Town name, but that these valid values may have been difficult to parse out from among a long string of data.
These results suggest that the Automated method script might be re-evaluated and improved by examination of the 138 records where invalid assignments were made. Many of the invalid assignment records contained a complex series of words in OriginalLocation, for example:
“PERUM LESATARI, DESAO LAMREUNG, DARUL IMARAH BESAR”
A quite sophisticated logic might be needed to find the correct keywords for deciphering this location. In other cases the OriginalLocation was not as complex, but the Automated method was too simplistic in its evaluation; for example for the OriginalLocation of:
“DE JUNIO 2 Y CALDER ACTIGUA BAHIA”, the Automated method recognized the word “Bahia” as a valid Admin1 Name, while the expert discovered that Antigua Bahia is the name of a neighborhood in the city of Guayaquil in the Canton Guayaquil Admin1 unit.
Crowdsourcing Method. Of the 322 records in the QC dataset for the Crowdsourcing method, 259 were in agreement with the Crowdsourcing results (accuracy of 80 percent), and 63 were in disagreement. However, among the 63 records in disagreement, 15 mismatches were due simply to transcription errors, for example, where an Admin1 Code of “11” was typed instead of the correct code of “TZ11” (the country code was omitted). These errors are quite easy to fix by visual inspection of the database. After correction of these errors, the accuracy rate of the Crowdsourcing method improved from 80 percent to 85 percent.
The median Certainty rating of records in the QC database (the degree to which volunteers were sure of their assignments) was 4, the second-highest rating of Certainty, so the experts were only slightly less certain of their designations than they were for the Automated method dataset. A lesser certainty is to be expected, since these records were more complex to evaluate (as suggested by the fact that the Automated method was unable to find matches). Surprisingly, however, the median Difficulty ranking of records in the QC dataset was 2, the same as for the Automated records, which indicates that even for the Crowdsourcing Method records, the OriginalLocation field contained a valid Admin1 name or City/Town name.Among the 63 records in disagreement, 19 were geolocated by the crowd while the QC database found the records to be non-locatable. This suggests that the Phase 1 participants, in their zeal for success or inexperience, might have produced a result where one was not warranted. This may indicate that for best results, expert volunteers are needed.
The high accuracy rate for Crowdsourcing method is a promising indicator of the quality of Crowdsourced data, especially when experienced professional volunteers are recruited. The smaller accuracy rate for the Automated method suggests that sophisticated algorithms need to be developed to impart a higher degree of intelligence to the computer – one way to develop this machine intelligence is through a QC check such as that done here where mismatches can be examined to capture the human thought process.
The fruits of the volunteers’ efforts (maps, data, and metadata) is found here: http://www.usaid.gov/results-and-data/progress-data/data/dca
More detail about each phase is posted in the Detailed Job Description.
Throughout the project, interested parties followed the progress of the event by: Tweeting/following @USAID_Credit or viewing their FaceBook page. The hashtag for the event was #USAIDcrowd .
“I enjoyed the crowdsourcing aspect because this is an important new tool. Online project management also is important. The intensity of the project was interesting (over 500 emails in a one month period). It also is important to discover, develop and use tools for keeping volunteers engaged and motivated, especially when the group is so big.”
“It was the most interesting philanthropy project I have worked on since College. Although simple, it resparked a fire which had been missing. I look forward to continuing with GISCorps as often as possible in the future.”
“This was a great example of how globalization benefits Aid efforts. A global community of GIS professionals was able to come together from all over the globe to help benefit humanity.”
“Helping accomplish a worthy goal. Sharing GIS experience for the greater good. Very well coordinated and great opportunity for volunteerism.”