About this tutorial

Since 2009, the International Peace Information Service (IPIS) manages a database on mining site visits in eastern DR Congo. Since January 2017, IPIS publishes the data it has collected over the last decade on its Open Data page. This tutorial explains how to use the IPIS Open Data, through various examples.

IPIS welcomes any feedback and questions on this tutorial, as well as on the database, via mail.

Why it is published

IPIS strongly believes in the power of Open Data and wants to encourage the use of the data that it publishes. This tutorial facilitates the analysis and use of the Open Data for researchers, risk managers, policymakers and observers working on mining, mineral trade, and security in eastern DRC

The data, as well as this tutorial, focus on artisanal mining site visits in eastern DRC. Some background knowledge of the artisanal mining sector, as well as the geo-political and security context, are required to use it correctly to compute statistics. This tutorial will provide guidance on the nature and structure of the data, as well as the context in which it has been collected. As such, it aims to enable anyone to explore the IPIS data. The tutotial demonstrates how data processing is done at IPIS, under constant improvement, and trying to balance effiency with data-wrangling solutions.

Tools to use

The tutorial computes statistics using the Open Source statistical programming language R - available from the R-Project website. It mainly uses functions from the dplyr library for data manipulation. Many good tutorials on dplyr can be found online, such as here. Its main concepts (filtering and selecting, grouping and summarizing, mutating and sorting, …) can, however, be found in any other advanced data manipulation tool, such as standard SQL databases like Postgres, Python’s Pandas library, Tableau or even in Excel, OpenOffice or LibreOffice - although the latter three are often much less practical. The R code below will allow users to familiarize with the data structure and understand the data manipulations necessary to compute specific statistics, and will serve as an example for computing similar statistics using R or inspire the right actions to make such calculations in other tools.

This tutorial focuses on computing of some key statistics. The geographic coordinates of mines are playing only a minor role. For more advanced GIS analysis, one could also load this data in a GIS tool like QGIS or use the spatial tools or libraries in any of the packages mentioned above. Our Open Data page allows anyone to easily connect a GIS tool to the Open Data using the WFS standard.

Getting the data

IPIS Open Data can be obtained through the IPIS Open Data portal, which also provides notes on how to download the data, a Data Dictionary explaining the different columns used, information on the Open Data license and other contextual information. IPIS regularly adds new data to this dataset, as well as to its online webmaps.

Setting up your environment

At the start of this tutorial, we have downloaded the latest Open Data (early January 2018) as a .csv file and have put R’s working directory to where we have saved the file.

First, let’s load all libraries we will use.

library('dplyr') # For data manipulation
library('tidyr') # For data cleaning
library('lubridate') # For easy date manipulation
library('ggplot2') # For plotting
library('leaflet') # For interactive webmaps
library('scales') # For easy scale funtion

And read the data.

data <- read.csv("cod_mines_curated_all_opendata_p_ipis.csv", stringsAsFactors = FALSE, na.strings=c(""))

Getting to know the data

As explained on our Open Data page, this dataset lists visits of artisanal mining sites in eastern DRC. Each visit is characterized by multiple columns, which hold information on six key aspects:

The visit

source
project
visit_date
…

The mine

name, pcode
longitude, latitude
province, …
workers_numb
…

The minerals

is_gold_mine, is_3t_mine
mineral1 (up to 3 minerals)
selling_points_mineral1, final_destination_mineral1
mercury
…

The armed groups

presence, interference
armed_group1 (up to 2 armed groups)
type_armed_group1
frequency_armed_group1
taxation_armed_group1, …
…

The state services

state_service1 (up to 4 state services)

Responsible sourcing initiatives

itsci
qualification
…

The Open Data FAQ and dictionary page gives more information on the meaning and possible values of each of the columns through a ‘Data Dictionnary’. This page explains, for example, that the column presence has a value 0 or 1, depending on whether or not at least one armed group is present at the site, whereas the column interference explains whether one (or both) of the armed groups on the site is either a non-state armed actor or a state armed actor engaging in illegal activities. We’ll use these columns a lot in the examples below. The page also provides some extra contextual information, such as on itsci tagging and on mercury processing. Make sure you read it to understand the column you will work with!

Now, let’s display the data.

data

Of course, this is a very large table which doesn’t fit on this webpage. You can use the arrows to see more columns, and the pages to see more entries/rows.

Computing explorative statistics

Now, lets get right to it and compute some first statistics.

> How many visits does this dataset contain?

We can simply count the number of lines.

data %>%
  summarise(count_visits = n())

Since mines are uniquely identified using a pcode, and multiple revisits of the same mine can occur in this dataset, we can compute the number of mines visited (in stead of the number of visits) by only keeping the most recent visit per mine.

> How many unique mines where visited?

data %>%
  group_by(pcode) %>% # For each unique mine ...
  arrange(desc(visit_date)) %>% slice(1) %>% # ... only continue with the most recent visit
  ungroup() %>%
  summarise(count_mines = n()) # Then count all lines

> How many visits where carried out per year and per project?

data %>%
  group_by(source, project, year = year(visit_date)) %>%
  summarise(count_visits = n()) %>%
  arrange(year)

## Warning: package 'bindrcpp' was built under R version 3.3.2

It’s clear that qualification missions took off in 2011, and projects lead by IPIS have made up to a thousand visits per year.

> How many times where mines visited?

data %>%
  group_by(pcode) %>% # For each mine ...
  summarise(count_visits = n()) %>% # ... count the number of visits
  group_by(count_visits) %>% # And for each number of visits ...
  summarise(count_mines = n()) %>% # ... count how many mines have this number of visits
  arrange(count_visits)

We can see that a considerable amount of the mines have been visited more than once. Some were even visited more than five times!

Separating datasets: IPIS visits, qualification missions and iTSCi info

The data exploration above reveals that there are three main categories of lines: visits by IPIS (and its partners), visits by qualification mission and extra iTSCi status info. IPIS visits are numerous and record data for all columns (with some exceptions as mentioned earlier). Qualification visits are the official record of mining site validations or ‘qualifications’ - an official procedure defined by the National Minister of Mines ¹ to certify mining sites as free of influence from armed groups using colored labels: green, yellow or red. During these visits data is collected in virtually all columns, with some exceptions (e.g. selling_points_mineral1 etc.). The iTSCi status info lines contain additional information on which sites apply iTSCi tagging - a system put in place by ITRI to make the trade of 3T minerals more transparent.

We could choose to file our queries to the entire dataset, but in many cases we will want to ask them specifically to the subset of IPIS visits and/or qualification visits. Therefore, it’s a good idea to create two such subsets.

We will also use this occasion to prepare both datasets a bit further and filter out irrelevant lines. Indeed, some visits in this dataset should not be taken into account when computing statistics on the state of artisanal mining in eastern DRC. For example, when IPIS teams visit a mining site and found it to be empty (no workers), it is included in this dataset as a site with no workers, mainly to keep track of the activity at older mines.

# Creating and cleaning IPIS dataset
data_ipis <- data %>%
  filter(grepl('IPIS', project)) %>% # Only visits during IPIS project
  filter(!is.na(workers_numb) & workers_numb > 0) # Has workers
# Creating and cleaning qualification dataset
data_qualification <- data %>%
  filter(project == 'Qualification status') # Only visits during qualification missions

These datasets now allow us to easily query both subsets.

> How many mine visits has IPIS executed by province? How many of these visits were 3T mines and gold mines?

To compute this by province ², we can group by province first.

data_ipis %>%
  group_by(province) %>% # Do the following for each province
  summarise(count_visits = n(), # Count the number of visits = lines
            count_visits_3t = sum(is_3t_mine), # Count the number of visits to 3T mines by summing the 1's this column contains for each 3T mine.
            count_visits_gold = sum(is_gold_mine)) %>% # Similar for gold mines
  arrange(desc(count_visits))

This is the first time we use information about the minerals - under the form of is_3t_mine and is_gold_mine. We’ll go more into detail on how to deal with different types of minerals further.

The above statistic discusses mining site visits. Now, in stead of mine visits, lets count the actual uniquely identifiable mines which have been visited (once or multiple times) by province.

> How many mining sites were visited by province?

data_ipis %>%
  group_by(pcode) %>%
  arrange(desc(visit_date)) %>% slice(1) %>% # Only continue with the most recent visit for each mine
  ungroup() %>%
  group_by(province) %>% # Same as above
  summarise(count_mines = n(),
            count_mines_3t = sum(is_3t_mine),
            count_mines_gold = sum(is_gold_mine)) %>%
  arrange(desc(count_mines))

We can also compute statistics on the qualification dataset

> How long ago is the qualification of mines?

data_qualification %>%
  mutate(qualification_year = year(visit_date)) %>%
  group_by(qualification_year, qualification) %>%
  summarise(count_mines = n())

Working with the most recent data

For many queries, only the most recent data collected at each mining site is needed. To prevent having to filter for the most recent visit for each mine every time (in the same way as we have done above), it can be practical to make a dataset which only contains these most recent visits.

Let’s first make sure we know what’s happening. Are we sure that the last visit contains the most recent data for each column? We want to ensure that there are no cases where data was collected on a specific topic (i.e. in a specific column) in an earlier visit, that has not been also collected in the last visit. In that case we would lose the earlier information by filtering for the most recent visit. So, did all visits, or more practically all projects, collect data in all columns? In general, we can state that during IPIS projects data has been recorded in all key columns starting from the first project and visits onward, and that some more detailed columns were added later. The most notable additions are interference from 2011 onward, selling_points_mineral1 etc., taxation_armed_group1 etc., frequency_armed_group1, state_service1 etc. and itsci from 2013 onward and mercury from 2015 onward. Once a column was collected in a project, it was generally also collected in later projects ³. The data uses NA or NULL values to indicate ‘no information was collected’, and these can be used to check which columns were collected in which projects and year. Observing these notes, we can conclude that for IPIS visits the information collected at the last visit can be considered as complete and as the most recent one.

Let us now create a dataset with only the last visit of each mine.

# For IPIS visits
data_ipis_lastvisit <-
  data_ipis %>%
  group_by(pcode) %>%
  arrange(desc(visit_date)) %>% slice(1) %>% # Only continue with the most recent visit for each mine
  ungroup()
# For qualification missions
data_qualification_lastvisit <-
  data_qualification %>%
  group_by(pcode) %>%
  arrange(desc(visit_date)) %>% slice(1) %>% # Only continue with the most recent visit for each mine
  ungroup()

This enables us to quickly answer some interesting questions.

> What is the average amount of workers for gold mines in the Mambasa territory, according to IPIS visits?

data_ipis_lastvisit %>%
  filter(territoire == "Mambasa", is_gold_mine == 1) %>%
  summarise(sum = sum(workers_numb),
            avg = mean(workers_numb), # Average
            median = median(workers_numb),
            min = min(workers_numb),
            max = max(workers_numb),
            nonna = sum(!is.na(workers_numb))) # Number of lines (with data on number of workers)

It’s a good idea to always compute a total count too when looking for a percentage. This way one can spot occasions where the total count is too low to make any strong statistical statements. Also, since for most columns the data is highly non-Gaussian, its a good idea to not only rely on means/averages, but to also include other statistics to describe the distribution. The median, for example, is much less influenced by extreme values.

> Which territory in the Kivu’s has the highest percentage of mines with interference according to IPIS visits?

data_ipis_lastvisit %>%
  filter(province %in% c("Nord-Kivu", "Sud-Kivu")) %>%
  group_by(territoire) %>%
  summarise(share_mines_interference = mean(interference, na.rm = TRUE)) %>% # The "na.rm = TRUE" code means that we discard the (small amount of) mines with no information ('NA') on the interference to compute this share.
  arrange(desc(share_mines_interference))

Note that we specified that mines with interference = NA are not counted as mines with interference. (This is the case for a small amount of mine visits in 2009 and 2010, when for some mines IPIS was not yet able to determine if there was interference at the site).

Comparing revisits

Of course, we can still access all visits in the data dataset. Let’s ask some questions that compare multiple visits of the same mines.

> How does interference evolve over multiple visits to the same mine?

We’ll have to ask this question for a specific mine, so first let’s find a mine with multiple revisits.

data %>% 
  group_by(pcode) %>%
  summarise(count_visits = n()) %>%
  top_n(5, count_visits)

Now, let’s inspect the visits to the mine with pcode ‘codmine00906’

data %>%
  filter(pcode == 'codmine00906') %>%
  select(name, project, visit_date, workers_numb, presence, interference, armed_group1, armed_group2)

It appears that FARDC elements have been recorded on and off, and that they had left the site by the time of the most recent visits.

> How does the number of workers change between the wet and dry season?

During our 2017 project around Mambasa, we visited mines both in the wet and dry season as part of a follow-up. Lets compare the number of workers between both seasons.

data %>%
  filter(project == "IPIS - PPA Mambasa 2017") %>%
  transmute(pcode, visit_date, period = ifelse(visit_date < "2017-04-01", "dry", "wet"), workers_numb) %>%
  group_by(period) %>%
  summarise(sum = sum(workers_numb),
            avg = mean(workers_numb),
            median = median(workers_numb))

The data shows that mines are typically less active during the wet period.

Weighing with number of workers

With artisanal mines varying from a handful of workers to more than a thousand workers, it can be useful to weigh a statistic by the number of workers at each site, to account for how many miners are affected.

> How many workers work at mines with interference by an armed group according to IPIS visits?

data_ipis_lastvisit %>%
  group_by(interference) %>%
  summarise(count_mines = n(), 
            count_workers = sum(workers_numb))

> Which provinces have the highest number of mines with interference, or workers working at mines with interference, according to IPIS visits?

data_ipis_lastvisit %>%
  group_by(province) %>%
  summarise(count_mines = n(), 
            count_mines_interference = sum(interference, na.rm = TRUE),
            count_workers_interference = sum(interference*workers_numb, na.rm = TRUE),
            share_mines_interference = mean(interference, na.rm = TRUE),
            share_workers_interference = sum(interference*workers_numb, na.rm = TRUE)/sum(workers_numb)
              ) %>%
  arrange(desc(share_workers_interference))

Notes on context and nuance

An important note to be made here, and to be remembered in general, is that the IPIS data is not exhaustive (not all mines have been visited, and visits are not frequent) and is statistically biased (some projects have visited a specific subset of mines). Since in 2013-2014 and 2015 IPIS visited a large number of artisanal mining sites in eastern DRC, trying to cover as many sites as possible in all relevant territories, the visits in those years can be considered a fairly unbiased statistical sample. In 2016 and 2017, however, IPIS’s visits were more focused. The most important remarks in this context revolve around the following projects:

The ‘IOM 2016-2018’ project focused its mine visits in regions where mines had earlier been qualified by an official qualification mission, or where the ‘joint qualification mission’ planned future missions.
The ‘IPIS - PPA Mambasa 2017’ project had a very specific geographic and mineral focus, visiting only gold sites around Mambasa, selected because of its high density of mining sites and relatively low interference (at that point in time). Additionally, each site was visited two times as part of a follow up study, resulting in a high amount of visits with similar characteristics.

This explains why there are many more qualified mines and remarkably fewer mines with interference in the subset of 2017 visits, when compared to earlier years with a high amount of visits.

data_ipis_lastvisit %>%
  mutate(year = year(visit_date)) %>%
  group_by(year) %>%
  summarise(count_mines = n(), 
            count_mines_interference = sum(interference, na.rm = TRUE),
            share_mines_interference = mean(interference, na.rm = TRUE))

Due to the bias in the sample of mines visited, one cannot immediately conclude from these numbers that, for example, 2017 saw a global trend towards more qualifications or less interference from armed groups, or that gold sites have become more prevalent.

For more contextual information, we invite users to read the respective IPIS reports, available on our website. We also urge users to interpret statistical results with care and get in contact with IPIS researchers if they have questions on specific results.

Dealing with multiple minerals and armed groups

At a given mining site, multiple minerals can be mined, multiple armed groups and multiple state services can be present. The dataset accounts for this by repeating the columns on mineral-related subjects three times (mineral1, mineral2, mineral3 etc.), those on armed groups two times (armed_group1, armed_group2 etc.) and those on state services four times (state_service1, state_service2, state_service3, state_service4 etc.) (with the first mineral, armed group or state service always being the main one). Additionally, the dataset includes information on minerals and armed groups in columns such as is_3t_mine, is_gold_mine, presence and interference, that combine that information on multiple minerals or armed groups into one useful column for easier queries. If we need data like the specific mineral names or armed group names, however, we will need to find a way to make queries that take all columns on minerals and/or all armed groups into account.

There are various ways to do this, and this is the method we’ve found most useful: first make a new dataset which brings all mineral information into one column, and subsequently use that dataset for queries. For minerals, this can be done as follows.

data_ipis_mineral <- 
  bind_rows(
    data_ipis %>% 
      transmute(
        pcode, # Select the column you'll want to use
        visit_date,
        province,
        territoire,
        workers_numb,
        is_3t_mine,
        is_gold_mine,
        mineral = mineral1,
        selling_points_mineral = selling_points_mineral1,
        final_destination_mineral = final_destination_mineral1,
        presence,
        interference,
        armed_group1,
        armed_group2),
    data_ipis %>% 
      filter(!is.na(mineral2)) %>%
      transmute(
        pcode,
        visit_date,
        province,
        territoire,
        workers_numb,
        is_3t_mine,
        is_gold_mine,
        mineral = mineral2,
        selling_points_mineral = selling_points_mineral2,
        final_destination_mineral = final_destination_mineral2,
        presence,
        interference,
        armed_group1,
        armed_group2),
    data_ipis %>% 
      filter(!is.na(mineral3)) %>%
      transmute(
        pcode,
        visit_date,
        province,
        territoire,
        workers_numb,
        is_3t_mine,
        is_gold_mine,
        mineral = mineral3,
        selling_points_mineral = selling_points_mineral3,
        final_destination_mineral = final_destination_mineral3,
        presence,
        interference,
        armed_group1,
        armed_group2))

The resulting dataset contains a line for each mineral of each (IPIS) visit. There are thus multiple lines per visit (one for each mineral). As long as we group or filter our queries by/for mineral, the result will again contain one line per visit, and we can formulate our queries as before.

> How many artisanal miners are involved in mining each of the different minerals?

data_ipis_mineral %>%
  group_by(pcode) %>%
  arrange(desc(visit_date)) %>% slice(1) %>% # Only continue with the most recent visit for each mine
  ungroup() %>%
  group_by(mineral) %>% # Group by the unified mineral column.
  summarise(count_mines = n(),
            count_workers = sum(workers_numb)) %>%
  arrange(desc(count_workers))

Notice that a worker working at a site where more than one mineral is mined, is counted for both minerals! The sum of count_workers will thus be more than the total amount of workers in data_ipis_lastvisit.

Lets create a similar dataset for armed groups

data_ipis_armed_group <- 
  bind_rows(
    data_ipis %>% 
      transmute(
        pcode,
        visit_date,
        province,
        territoire,
        workers_numb,
        is_3t_mine,
        is_gold_mine,
        mineral1,
        mineral2,
        mineral3,
        presence,
        interference,
        armed_group = armed_group1,
        taxation_armed_group = taxation_armed_group1,
        commerce_taxation_armed_group = commerce_taxation_armed_group1,
        entrance_taxation_armed_group = entrance_taxation_armed_group1,
        monopoly_armed_group = monopoly_armed_group1,
        buying_minerals_armed_group = buying_minerals_armed_group1,
        digging_armed_group = digging_armed_group1,
        forced_labour_armed_group = forced_labour_armed_group1,
        pillaging_armed_group = pillaging_armed_group1),
    data_ipis %>% 
      filter(!is.na(armed_group2)) %>%
      transmute(
        pcode,
        visit_date,
        province,
        territoire,
        workers_numb,
        is_3t_mine,
        is_gold_mine,
        mineral1,
        mineral2,
        mineral3,
        presence,
        interference,
        armed_group = armed_group2,
        taxation_armed_group = taxation_armed_group2,
        commerce_taxation_armed_group = commerce_taxation_armed_group2,
        entrance_taxation_armed_group = entrance_taxation_armed_group2,
        monopoly_armed_group = monopoly_armed_group2,
        buying_minerals_armed_group = buying_minerals_armed_group2,
        digging_armed_group = digging_armed_group2,
        forced_labour_armed_group = forced_labour_armed_group2,
        pillaging_armed_group = pillaging_armed_group2))

The following query puts it to use.

> Which are the main armed groups, measured by how many workers work at mines where they operate? At which percentage of their sites do they engage in illegal taxation?

data_ipis_armed_group %>%
  group_by(pcode) %>%
  arrange(desc(visit_date)) %>% slice(1) %>% # Only continue with the most recent visit for each mine
  ungroup() %>%
  group_by(armed_group) %>% # Group by the unified armed group column.
  summarise(count_mines = n(),
            count_workers = sum(workers_numb), 
            share_mines_taxation = mean(taxation_armed_group, na.rm = TRUE)) %>%
  arrange(desc(count_workers))

Again, workers working at mines where more than one armed group is present are counted for each group here. Miners working at mines where no armed group is present are listed only once, in the entry with no armed group.

For state services, a similar dataset can be made. Additionally, if one would for example want to ask queries that are related to both mineral and armed_group, one would first need to create a data_ipis_armed_group_mineral by performing the actions on data_ipis_armed_group that were used to create data_ipis_mineral from data.

Working with destinations

Each of the minerals has columns specifying the selling points (trade centers or ‘points de vente’) where the minerals are first sold, and the final destination of the minerals, where they connect to international trade flows. As mentioned earlier, data was collected on this topic from 2013 onward. Since we included these columns in the construction of the data_ipis_mineral dataset above, we can use that dataset, combined with a function to search for names in the plain text fields of these columns (we’ll use grepl() for this), to investigate selling points and final destinations.

> How many cassiterite mines have ‘Itebero’ as their selling point

data_ipis_mineral %>%
  group_by(pcode) %>%
  arrange(desc(visit_date)) %>% slice(1) %>% # Only continue with the most recent visit for each mine
  ungroup() %>%
  filter(mineral == "Cassitérite") %>%
  filter(grepl('Itebero', selling_points_mineral)) %>%
  summarise(count_mines = n())

> Which armed group(s), if any, control the mines in Mambasa exporting to Bafwabangu (or other places)?

data_ipis_mineral %>%
  group_by(pcode) %>%
  arrange(desc(visit_date)) %>% slice(1) %>% # Only continue with the most recent visit for each mine
  ungroup() %>%
  filter(territoire == "Mambasa") %>%
  filter(grepl('Bafwabangu', selling_points_mineral)) %>%
  select(pcode, armed_group1, armed_group2)

> Which are the major final destinations for gold, for the mines visited by IPIS? How many mines with/without inferference have them as final destination?

data_ipis_mineral %>%
  group_by(pcode) %>%
  arrange(desc(visit_date)) %>% slice(1) %>% # Only continue with the most recent visit for each mine
  ungroup() %>%
  filter(mineral == "Or") %>%
  mutate(final_destination_mineral = strsplit(as.character(final_destination_mineral), ", ")) %>% # By splitting the column's content on the comma and 'unnesting', we create an intermediate dataset with one line per final destination.
  unnest(final_destination_mineral) %>%
  group_by(final_destination_mineral) %>%
  summarise(count_mines = n(),
            count_mines_interference = sum(interference, na.rm = TRUE),
            share_mines_interference = mean(interference, na.rm = TRUE)) %>% 
  top_n(5, count_mines)

Crossing IPIS visits with qualification missions

At this stage, the data_ipis and data_ipis_lastvisit dataset contains information on the qualification in the qualification column, but only if that information was known to the interviewee of the IPIS visit. They do not, however, contain information on the official qualification, if the mine would also have been visited by a qualification mission at another time. To compute statistics involving both, we can cross information from the last IPIS visits with information from the (last) official qualification mission, to compute statistics involving both. This can be done using a join function, and the pcode column which uniquely identifies each mine in our dataset, and which was attributed by IPIS. This way the data_ipis dataset can be appended by columns from the qualification visits.

> Did IPIS in its latest data record interference at a site which had been qualified as ‘green’ earlier?

data_ipis_lastvisit %>%
  left_join(
    data_qualification_lastvisit %>% # Join in qualification data ...
      select(pcode, qualification_official = qualification, qualification_official_date = visit_date), by = "pcode") %>% # ... taking only the key qualification columns and renaming them.
  filter(qualification_official_date <= visit_date) %>% # Latest IPIS visit after latest Qualification visit.
  filter(qualification_official == "Vert" & interference == 1) %>% # Qualification visit was green and IPIS visit saw interference
  select(pcode, name, visit_date, interference, armed_group1, armed_group2, qualification_official_date, qualification_official)

The entire dataset producted by this join show that quasi all green qualifications are at sites without interference, as one would think. These are the few exceptions, which could be used as the starting point for further investigation.

Crossing IPIS visits with iTSCi information

Just like with qualifications, the information on whether or not iTSCi tagging is used at the site comes, or may come in this case, from a different source than the IPIS visits: the additional iTSCi status lines. Hence, to ask questions related to both iTSCi tagging and IPIS visits, we’ll again perform a join. Since we don’t have an official source for iTSCi related data, we’ll take any info we have about iTSCi in this join.

> How is iTSCi tagging correlated to interference at mining sites?

data_ipis_lastvisit %>%
  left_join(
    data %>% # Join in any data (independent of the project or source) ...
      filter(itsci == 'Actif') %>% # ... that has iTSCi info ...
      group_by(pcode) %>% arrange(desc(visit_date)) %>% slice(1) %>% ungroup %>% # ... (leave only one line per pcode if there would be multiple) ...
      select(pcode, itsci_all = itsci), by = "pcode") %>% # ... taking only the itsci column and renaming it.
  group_by(itsci_all) %>%
  summarise(count_mines = n(),
            count_workers = sum(workers_numb), 
            share_mines_interference = mean(interference, na.rm = TRUE)) %>%
  arrange(desc(count_workers))

This confirms that iTSCi tagging is quasi always done at mines without interference - because that’s a prerequisite for the iTSCi scheme, and inversely because the presence of the iTSCi scheme might discourage interference.

Exploring data on a map

For proper GIS analysis, one can use dedicated tools as mentioned above. Here, we will quickly create a simple webmap that allows to explore the spatial distribution of one particular feature: interference.

data_ipis_lastvisit %>% 
  leaflet() %>% 
  addProviderTiles(providers$CartoDB.Positron) %>%
  addCircleMarkers(
    radius = ~rescale(sqrt(workers_numb), c(1,15)),
    color = ~ifelse(interference == 1, "red", "green"),
    stroke = FALSE, 
    fillOpacity = 0.5
  )

Wrapping up

We’ve explored how to use the IPIS Open Data. Now it’s your turn! Download the data, compute some statistics, maybe cross the data with other data sources, and share your results with the world! We’d love to see how you use our data, and are happy to receive your feedback to continue to improve how we share Open Data. Contact us! Also, if you find this tutorial helpful, you may consider sharing the code from your own research online in a similar way (The page you are reading now is a so-called ‘RMarkdown’ notebook - a popular way to share R code). This way other researchers can reproduce your results and learn from your approach and insights.

Note that qualification missions are included in this Open Data dataset thanks to a MoU signed between IPIS and BGR, the latter remaining the official source of the data to be attributed in each use case. IPIS assigned its own pcode’s identifying individual mines to this data, to allow crossing with data on IPIS visits. Even though this was done with a lot of care by the IPIS researchers, some imperfections may have occured when matching these sites. As an example: IPIS visits might have considered multiple pits/mines as multiple sites, whereas qualification visits might later have considered them all being part of the same site. In this case, in general the pcode of the biggest pit is assigned to the qualification visit, which is as such treated as qualified in our calculations, but the other pits will not have a qualification visit linked to them, and are (wrongly) treated as unqualified. The inverse situation could also occur. In general, it is advised to use the statistics which treat qualification or itsci information as an indication, and always check the findings in the tables or plot the mines on a map to fully understand the situation.↩
As the IPIS Open Data dictionary page notes, there is a column for the pre-2015 DRC provincial subdivision called province_old and a column for the post-2015 provinces called province.↩
With the exception of some specific types of interferences (like forced_labour_armed_group1, collected since 2013 except during the ‘IPIS - PPA Mambasa 2017’ project), and the qualification column which was only collected (during Qualification visits and) during the ‘IPIS - PROMINES MoFA 2013-2014’ project. For this column however, we advise to only consider the official information coming from the Qualification visits, see the chapter on ‘Crossing IPIS visits with qualification missions’.↩

IPIS Open Data Tutorial

March 19th, 2018