Damage caused by Gokha earthquake. Image courtesy of BBC

by George Taniwaki

At the beginning of each quarter edX and Microsoft offer a one-month long course called DAT 102x, Data Science Capstone. The class consists of a single machine learning project using real-world data. The class this past April used data collected by the Nepal government after the Gorkha earthquake in 2015. The earthquake killed nearly 9,000 people and left over 100,000 homeless.

The assignment was to predict damage level for individual buildings based on building characteristics such as age, height, location, construction materials, and use type. Here’s are steps I use to solve this or any other machine learning problem.

**About machine learning contests**

The class project is operated as a contest. Students download training and test datasets, create a model using the training dataset, use the model to make predictions for the test dataset, submit their predictions, and see their scored results on a leaderboard.

As is common for machine learning contests, the training data consists of two files. The first file includes a large number of records (for the earthquake project, there were 10,000). Each record consists of an index and the independent variables (38 building parameters). A separate label file contains only two columns, the index and and their associated dependent variable(s) (in the earthquake project, there is only one, called damage_grade). The two files must be joined before creating a model.

The test file (10,000 more records of building indexes and parameters) has the same format as the training file. You use your model to create an output file consisting of the index and the predicted values of the dependent variable(s). You submit the file to a web service which then scores your results.

You can submit multiple predictions to be scored, adjusting your model after each submission in an attempt to improve your score. Your final score is based on the highest score achieved. To reduce the chance that competitors (students) overfit their model to the test data, the score is based on an undisclosed subset of records in the test file.

**Approach to model building**

The general approach to building a machine learning model is to first examine the dependent variables using univariate methods (step 1). Repeat for the independent variables (step 2). Normalize the variables (step 3). Examine correlations using multivariate methods (step 4). Select the relevant variables, choose a model, and build it (step 5). Evaluate and test the model (step 6) and tune the parameters (step 7) to get the best fit without overfitting. Some iteration may be required.

**Step 1: Univariate statistics for the dependent variable**

The dependent variable, damage_grade, is labeled with an integer from 1 to 3. Higher values mean worse damage. However, the intervals between each class are not constant, so the scale is ordinal not linear. Descriptive statistics are shown in Table 1 and Figure 1 below.

**Table 1**. Descriptive statistics for dependent variable

Variable name |
Min |
Median |
Max |
Mean |
Std dev |

damage_grade | 1 | 2 | 3 | 2.25 | 0.61 |

**Figure 1.** Histogram of dependent variable

**Step 2: Univariate statistics for the independent variables**

As mentioned above, there are 38 building parameters. Details of the variables are given at Data Science Capstone. The 38 independent variables can be divided into 4 classes, interval integer, float, categorical and binary as shown in Table 2 below. Notice that the parameter count_families is defined as a float even though it only takes on integer values.

**Table 2.** Summary of independent variables

Variable type |
Quantity |
Examples |

Binary (Boolean) | 22 | has_superstructure_XXXX, has_secondary_use, has_secondary_use_XXXX |

Integer, interval | 4 | count_floors_pre_eq, height, age, area |

Float | 1 | count_families |

Categorical | 11 | geo_level_1_id, geo_level_2_id, geo_level_3_id, land_surface_condition, foundation type, roof_type, ground_floor_type, other_floor_type, position, plan_configuration, legal_ownership_status |

The binary variables fall into three groups. First is has_superstructure_XXXX, where XXXX can be 11 possible materials used to produce the building superstructure such as adobe mud, timber, bamboo, etc. The second is has_secondary_use_XXXX, where XXXX can be 10 possible secondary uses for the building such as agriculture, hotel, school, etc. Finally, has_secondary_use indicates if any has_secondary_use_XXXX variables is true.

Whenever I have groups of binary variables, I like to create new interval integer variables based on them. In this case, they are named count_superstructure and count_secondary_use which are a count of the number of true values for each. count_superstructure can vary from 0 to 11 while count_secondary_use can vary from 0 to 10.

For the 7 numerical parameters, their minimum, median, maximum, mean, standard deviation, and histogram are shown in Table 3 and Figure 2 below. Possible outliers, which occur in all 7 numerical variables, are highlighted in red.

**Table 3.** Descriptive statistics for the 7 numerical variables. Red indicates possible outliers

Variable name |
Min |
Median |
Max |
Mean |
Std dev |

count_floors_pre_eq | 1 | 2 | 9 | 2.15 | 0.74 |

height | 1 | 5 | 30 | 4.65 | 1.79 |

age | 0 | 15 | 995 | 25.39 | 64.48 |

area | 6 | 34 | 425 | 38.44 | 21.27 |

count_families | 0 | 1 | 7 | 0.98 | 0.42 |

count_superstructure | 1 | 1 | 8 | 1.45 | 0.78 |

count_secondary_use | 0 | 0 | 2 | 0.11 | 0.32 |

**Figure 2.** Histograms for the 7 numerical variables. Red indicates possible outliers

Upon inspection of the dataset, none of the numerical variables have missing values. However, it appears that for 40 records, age has an outlier values of 995. The next highest value is 200. Further, the lowest age value is zero, which does not work well with the log transform. To clean the data, we will create a new binary variable named has_missing_age, with value = 1 if age = 995, else = 0 otherwise. We will also create a new numerical variable named adjust_age, with value = 1 if age = 0, else = 15 (the median) if age = 995, else = age otherwise.

The variable area has a wide range, but does not appear to contain any outliers.

The variables count_families, count_superstructure, and count_secondary_use do not seem to have any outliers. They also do not need to be normalized.

For the 11 categorical variables, the number of categories (enums), and the names of the enums with the largest and smallest counts are shown in Table 4 below.

**Table 4**. Descriptive statistics for the 11 categorical variables. Red indicates one or more enums has fewer than 10 recorded instances

Variable name |
Count enums |
Max label / count |
Min label / count |
Comments |

geo_level_1_id | 31 | 0 / 903 | 30 / 7 | hierarchical |

geo_level_2_id | 1137 | 0 / 157 | tie / 1 | hierarchical |

geo_level_3_id | 5172 | 7 / 24 | tie / 1 | hierarchical |

land_surface_condition | 3 | d502 / 8311 | 2f15 / 347 | |

foundation type | 5 | 337f / 8489 | bb5f / 48 | |

roof_type | 3 | 7g76 / 7007 | 67f9 / 579 | |

ground_floor_type | 5 | b1b4 / 8118 | bb5f / 10 | |

other_floor_type | 4 | f962 / 6412 | 67f9 / 415 | correlated with ground_floor_type |

position | 4 | 3356 / 7792 | bcab / 477 | |

plan_configuration | 9 | a779 / 9603 | cb88 / 1 | lumpy distribution |

legal_ownership_status | 4 | c8e1 / 9627 | bb5f / 61 | lumpy distribution |

None of the categorical or binary variables have missing values. None of the enums are empty, though some of the enums (highlighted in red in Table 4 above) have fewer than 10 recorded instance, so may bias the model. Unfortunately, we do not have any information about the meaning of the enum labels, so do not have a good way to group the sparse enums into larger categories. If this were a real-world problem we should take time to investigate this issue.

**Step 3: Normalize the numerical variables**

As can be seen in Figure 2 above, some of the independent numerical variables are skewed toward low values and have long tails. We can normalize the distributions by creating new variables using the log transform function. The resulting variables will have names prefixed with ln_. The descriptive statistics of the normalized variables are shown in Table 5 and Figure 3 below.

**Table 5.** Descriptive statistics for adjusted and normalized numerical parameters

Variable name |
Min |
Median |
Max |
Mean |
Std dev |

ln_count_floors_pre_eq | 0 | 0.69 | 2.19 | 0.70 | 0.36 |

ln_height | 0 | 1.61 | 3.40 | 1.46 | 0.40 |

ln_adjust_age | 0 | 2.71 | 5.30 | 2.61 | 1.12 |

ln_area | 1.79 | 3.53 | 6.05 | 3.54 | 0.46 |

**Figure 3**. Histograms of adjusted and normalized numerical parameters

This completes the first 3 steps of building a machine learning model. The remaining steps to solve this contest problem are posted at How to create a machine learning model – Part 2.