*Town of Barpak after Gorkha earthquake. Image from The Telegraph (UK)*

by George Taniwaki

This is the final set of my notes from a machine learning class offered by edX. Part 1 of this blog entry is posted in June 2018.

At the end of step 6, I discovered that none of my three models met the minimum F score (at least 0.60) needed to pass the class. Starting with the configuration shown in Figure 5, I modified my experiment by replacing the static data split with partition and sampling using 10 evenly split folds. I used a random seed of 123 to ensure reproducibility.

I added both a cross-validation step and a hyperparameter tuning step to optimize results. To improve performance, I added a Convert to indicator values module. This converts the categorical variables into dummy binary variables before running the model.

Unfortunately, the MAML ordinal regression module does not support hyperparameter tuning. So I replaced it with the one-vs-all multiclass classifier. The new configuration is shown in Figure 6 below. (Much thanks to my classmate Robert Ritz for sharing his model.)

**Figure 6.** Layout of MAML Studio experiment with hyperparameter tuning

For an explanation of how hyperparameter tuning works, see Microsoft documentation and MSDN blog post.

In the earlier experiments, the two-class logistic regression classifier gave the best results. I will use it again with the one-vs-all multiclass model. The default parameter ranges for the two-class logistic regression classifier are: Optimization tolerance = 1E-4, 1E-7, L1 regularization weight = 0 .01, 0.1, 1.0, L2 regularization weight = 0.01, 0.1, 1.0, and memory size for L-BFGS = 5, 20, 50.

**Table 12a**. Truth table for one-vs-all multiclass model using logistic regression classifier

Truth table |
Is 1 |
Is 2 |
Is 3 |
TOTAL |

Predict 1 | 296 | 156 | 20 | 472 |

Predict 2 | 621 | 4633 | 1651 | 6905 |

Predict 3 | 21 | 847 | 1755 | 2623 |

TOTAL |
936 | 5636 | 3426 | 10000 |

**Table 12b**. Performance measures for one-vs-all multiclass model using logistic regression classifier

Performance |
Value |

Avg Accuracy | 0.76 |

F1 Score | 0.64 |

F1 Score (test data) | Not submitted |

The result is disappointing. The new model has an F1 score of 0.64, which is lower than the F1 score of the ordinal regression model using the logistic regression classifier.

Originally, I excluded geo_level_2 from the model even though the Chi-square test was significant because it consumed too many degrees of freedom. I rerun the experiment with the variable and keeping all other variables and parameters the same.

**Table 13a**. Truth table for one-vs-all multiclass model using logistic regression classifier and including geo_level_2

Truth table |
Is 1 |
Is 2 |
Is 3 |
TOTAL |

Predict 1 | 355 | 218 | 27 | 600 |

Predict 2 | 564 | 4662 | 1446 | 6672 |

Predict 3 | 19 | 756 | 1953 | 2728 |

TOTAL |
938 | 5636 | 3426 | 10000 |

**Table 13b**. Performance measures for one-vs-all multiclass model using logistic regression classifier and including geo_level_2

Performance |
Value |

Avg Accuracy | 0.80 |

F1 Score | 0.70 |

F1 Score (test data) | Not submitted |

The resulting F1 score using the test dataset is 0.70, which is better than any prior experiments and meets our target of 0.70 exactly.

I will try to improve the model by adding a variable measuring height/floor. This variable is always positive, skewed toward zero and has a long tail. To normalize it, I apply the natural log transform and name the variable ln_height_per_floor. Table 14 and Figure 7 show the summary statistics.

**Table 14.** Descriptive statistics for ln_height_per_floor

Variable name |
Min |
Median |
Max |
Mean |
Std dev |

ln_height_per_floor | -1.79 | 0.69 | 2.30 | 0.76 | 0.25 |

**Figure 7.** Histogram of ln_height_per_floor

I run the model again with no other changes.

**Table 15a**. Truth table for one-vs-all multiclass model using logistic regression classifier, including geo_level_2, height/floor

Truth table |
Is 1 |
Is 2 |
Is 3 |
TOTAL |

Predict 1 | 366 | 227 | 28 | 621 |

Predict 2 | 557 | 4640 | 1436 | 6633 |

Predict 3 | 15 | 769 | 1962 | 2746 |

TOTAL |
938 | 5636 | 3426 | 10000 |

**Table 15b**. Performance measures for one-vs-all multiclass model using logistic regression classifier, including geo_level_2, height/floor

Performance |
Value |

Avg Accuracy | 0.80 |

F1 Score | 0.70 |

F1 Score (test data) | Not submitted |

The accuracy of predicting damage_level = 1 or 3 increases, but the accuracy of 2 decreases. Resulting in no change in average accuracy or the F1 score.

The accuracy of the one-vs-all multiclass model was significantly improved by adding geo_level_2. Let’s see what happens if I add this variable to the ordinal regression model which produced a higher F1 score than the one-vs-all model.

**Table 16a**. Truth table for ordinal regression model using logistic regression classifier, including geo_level_2, height/floor

Truth table |
Is 1 |
Is 2 |
Is 3 |
TOTAL |

Predict 1 | 80 | 59 | 1 | 140 |

Predict 2 | 227 | 1557 | 542 | 2326 |

Predict 3 | 3 | 246 | 585 | 834 |

TOTAL |
310 | 1862 | 1128 | 3300 |

**Table 16b**. Performance measures for ordinal regression model using logistic regression classifier, including geo_level_2, height/floor

Performance |
Value |

Avg Accuracy | 0.78 |

F1 Score | 0.67 |

F1 Score (test data) | Not submitted |

Surprisingly, ordinal regression produces worse results when the geo_level_2 variable is included than without it.

I spent a lot of effort adjusting and normalizing my numeric variables. They were mostly integer values with small range and did not appear to be correlated to damage_grade. Could the model be improved by treating them as categorical? Let’s find out.

First I perform a Chi Square test to confirm all of the variables are significant. Then run the model after converting all the values from numeric to strings, and converting all the variables from numeric to categorical.

**Table 17**. Chi-square results of numerical values to damage_grade

Variable name |
Chi-square |
Deg. of freedom |
P value |

count_floor_pre_eq | 495 | 14 | < 2.2E-16* |

height | 367 | 37 | < 2.2e-16* |

age | 690 | 60 | < 2.2e-16* |

area | 738 | 314 | < 2.2e-16* |

count_families | 76 | 14 | 1.3e-10* |

count_superstructure | 104 | 14 | 7.1e-16* |

count_secondary_use | 79 | 4 | 3.6e-16* |

*One or more enums have sample sizes too small to use Chi-square approximation

[ ] P value greater than 0.05 significance level

**Table 18a**. Truth table for ordinal regression model using logistic regression classifier, including geo_level_2, height/floor, and converting numeric to categorical

Truth table |
Is 1 |
Is 2 |
Is 3 |
TOTAL |

Predict 1 | 83 | 62 | 3 | 148 |

Predict 2 | 224 | 1544 | 540 | 2308 |

Predict 3 | 3 | 256 | 585 | 844 |

TOTAL |
310 | 1862 | 1128 | 3300 |

**Table 18b**. Performance measures for ordinal regression model using logistic regression classifier, including geo_level_2, height/floor, and converting numeric to categorical

Performance |
Value |

Avg Accuracy | 0.78 |

F1 Score | 0.67 |

F1 Score (test data) | Not submitted |

Changing the integer variables to categorical has almost no impact on the F1 score.

Table 19 below summarizes all nine models I built. Six of them achieved an F1 score of 0.60 or higher on the training data, which would probably have been sufficient to pass the class. Two of them had F1 score of 0.70 which would be a grade of 95 out of 100.

I was unable to run most of these models on the test dataset and submit the results to the data science capstone website. Thus, I do not know what my leaderboard F1 score would be. It is possible that I overfit my model to the training data and my leaderboard F1 score might be lower.

Finding the best combination of variables, models, and model hyperparameters is difficult to do manually. It took me several hours to build the nine models described in this blog post. Machine learning automation tools exist but are not yet robust, nor built into platforms like MAML Studio. (Much thanks again to Robert Ritz who pointed me to TPOT, a Python-based tool for auto ML.)

**Table 19.** Summary of models. Green indicates differences from base case, model 2

Model |
Variables |
Algorithm |
Training data |
F1 score (test data) |

1 | None | Naïve guess = 2 | None | 0.56 |

3 | 27 from Table 5 | Ordinal regression with decision forest | 0.67 split | 0.64 |

4 | 27 from Table 5 | Ordinal regression with SVM | 0.67 split | 0.57 (0.5644) |

2 | 27 from Table 5 | Ordinal regression with logistic regression | 0.67 split | 0.68 (0.5687) |

5 | 27 from Table 5 | One-vs-all multiclass with logistic regression, hyperparameter tuning | 10-fold partition | 0.64 |

6 | 27 from Table 5, geo_level_2 | One-vs-all multiclass with logistic regression, hyperparameter tuning | 10-fold partition | 0.70 |

7 | 27 from Table 5, geo_level_2, height/floor | One-vs-all multiclass with logistic regression, hyperparameter tuning | 10-fold partition | 0.70 |

8 | 27 from Table 5, geo_level_2, height/floor | Ordinal regression with logistic regression | 0.67 split | 0.67 |

9 | 27 from Table 5, convert numeric to categorical, geo_level_2, height/floor | Ordinal regression with logistic regression | 0.67 split | 0.67 |

*Damage caused by Gorkha earthquake. Image by Prakash Mathema/AFP/Getty Images*

by George Taniwaki

This is a continuation of my notes for a machine learning class offered by edX. Part 1 of this blog entry is posted in June 2018.

Pairwise scatterplots of the numerical variables after adjusting and normalizing are shown in Figure 4 below. The dependent variable (damage_grade) does not appear to be correlated with any of the numerical independent variables. Despite the lack of correlation, I included all the numeric variables when building the model. If I have time, I will convert these numerical variables into categorical ones.

Among the independent variables, covariance is highest between count_floors_pre_eq and height highlighted in green. This makes sense, taller buildings are likely to have more floors. If I have time, I will add a new variable height_per_floor (= height / count_floors_pre_eq).

**Figure 4.** Pairwise scatterplots of all numerical parameters. Correlations highlighted in green

There are 11 categorical and 23 binary parameters. I used the Chi-square test to compare distributions of these to the distribution of the dependent variable damage_grade, treated as categorical. The results are shown in Table 5 below.

All are statistically significant at 0.05 level, except for the 5 highlighted in red brackets. They will be excluded from the model. Two of the geo_level variables consume too many degrees of freedom given our sample size. (Even big datasets have limitations.) They are highlighted in orange and will be excluded from the model. If I have time, I will add a new custom geo_level variable with about 1000 degrees of freedom. The remaining 27 variables will be retained for use in the model.

**Table 5.** Chi-square results of categorical and Boolean values to damage_grade

Variable name |
Chi-square |
Deg. of freedom |
P value |

geo_level_1_id | 2746 | 60 | < 2.2e-16* |

geo_level_2_id | 6592 | 2272 | < 2.2e-16* |

geo_level_3_id | 13039 | 10342 | < 2.2e-16* |

land_surface_condition | 15.9 | 4 | 0.0032 |

foundation_type | 1857 | 8 | < 2.2e-16* |

roof_type | 1122 | 4 | < 2.2e-16 |

ground_floor_type | 1347 | 8 | < 2.2e-16* |

other_floor_type | 1117 | 6 | < 2.2e-16 |

position | 47.3 | 6 | 1.6e-08 |

plan_configuration | 46.5 | 16 | 8.0e-05* |

legal_ownership_status | 80.2 | 6 | 3.2e-15 |

has_superstructure_adobe_mud | 50.3 | 2 | 1.1e-11 |

has_superstructure_mud_mortar_stone | 1053 | 2 | < 2.2e-16 |

has_superstructure_stone_flag | 28.6 | 2 | 6.2e-07 |

has_superstructure_cement_mortar_stone | 53.8 | 2 | 2.1e-12 |

has_superstructure_mud_mortar_brick | 37.5 | 2 | 7.3e09 |

has_superstructure_cement_mortar_brick | 632 | 2 | < 2.2e-16 |

has_superstructure_timber | 64.9 | 2 | 8.0e-15 |

has_superstructure_bamboo | 55.1 | 2 | 1.1e-12 |

has_superstructure_rc_non_engineered | 296 | 2 | < 2.2e-16 |

has_superstructure_rc_engineered | 603 | 2 | < 2.2e-16* |

has_superstructure_other | 9.8 | 2 | 0.0072 |

has_secondary_use | 76.2 | 2 | < 2.2e-16 |

has_secondary_use_agriculture | 25.0 | 2 | 3.8e-06 |

has_secondary_use_hotel | 90.9 | 2 | < 2.2e-16 |

has_secondary_use_rental | 49.4 | 2 | 1.9e-11 |

has_secondary_use_institution | 10.8 | 2 | 0.0046* |

has_secondary_use_school | 32.1 | 2 | 1.1e-07* |

has_secondary_use_industry | 2.3 | 2 | [ 0.31 ]* |

has_secondary_use_health_post | 1.5 | 2 | [ 0.46 ]* |

has_secondary_use_gov_office | 4.2 | 2 | [ 0.12 ]* |

has_secondary_use_use_police | 0.8 | 2 | [ 0.68 ]* |

has_secondary_use_other | 15.3 | 2 | 0.00049* |

has_missing_age | 3.1 | 2 | [ 0.21 ]* |

*One or more enums have sample sizes too small to use Chi-square approximation

[ ] P value greater than 0.05 significance level

The dependent variable can take on the value 1, 2, or 3. I could use a classification method like multi-class logistic regression to create our model. However, there is a better way. I will use the ordinal regression algorithm available from Microsoft Azure Machine Learning (MAML).

Ordinal regression requires a binary classifier method. For this project, I will try three classifiers available in MAML, two-class logistic regression, two-class decision forest, and two-class support vector machine (SVM). I will use the default parameters for each classifier and submit all three models as entries in the contest. (This is sort of a cheat to improve the F1 score and my grade. In practice, you should only submit the results using the best model based on the training data.)

A simple model configuration using a static data split and without either a cross-validation step or a hyperparameter tuning step to optimize results is shown in Figure 5 below. I will add these steps later if the simple model does not meet the performance goals.

**Figure 5.** Layout of a simple MAML Studio experiment

I used a data split of 0.67, meaning 6,700 records were used to train the model and the remaining 3,300 were used to score and evaluate it. I used a random number seed of 123 to ensure every run of my experiment used the same split and produced replicable results.

A generic truth table for an experiment with 3 outcomes is shown in Table 6a below. Using data from the truth table, four performance metrics can be calculated, average accuracy, micro-average precision , micro-average recall, and geometric average F1 score. The calculation of the performance metrics is shown in Table 6b. Note that since all combinations are measured, recall, precision, and F1 score are all equal. Perhaps the contest should have used the macro-average recall and precision to calculate the F1 score.

**Table 6a.** Generic truth table for case with 3 outcomes

Truth table |
Is 1 |
Is 2 |
Is 3 |
TOTAL |

Predict 1 | TP1|TN2|TN3 | FP1|FN2|TN3 | FP1|TN2|FN3 | TP1+FP1 |

Predict 2 | FN1|FP2|TN3 | TN1|TP2|TN3 | TN1|FP2|FN3 | TP2+FP2 |

Predict 3 | FN1|TN2|FP3 | TN1|FN2|FP3 | TN1|TN2|TP3 | TP3+FP3 |

TOTAL | TP1+FN1 | TP2+FN2 | TP3+FN3 | Pop |

**Table 6b.** Performance measures for case with 3 outcomes

Performance |
Calculation |

Avg Accuracy | ∑((TP + TN) / Pop) / 3 |

Avg Precision (micro) | P = ∑TP / ∑(TP + FP)) = ∑TP / Pop |

Avg Recall (micro) | R = ∑TP / ∑(TP + FN)) = ∑TP / Pop |

F1 Score | (2 * P * R) / (P + R) = ∑TP / Pop |

Grading for the course is based on the F1 score for a hidden subset of the test dataset as shown in Table 7 below. F1 scores between these points will receive linearly proportional grades. For instance, an F1 score of 0.65 would earn a grade of 75.

**Table 7.** Grading of project based on F1 score (test data)

F1 Score |
Grade |

< 0.60 | 1 out of 100 |

0.60 | 60/100 |

0.64 | 70/100 |

0.66 | 80/100 |

0.70 | 95/100 |

Below are the results of my models.

The most common value for damage_grade is 2. Any prediction model should perform better than the naïve guess of predicting the damage is 2 for all buildings.

**Table 8a**. Truth table for naive guess of damage_grade = 2

Truth table |
Is 1 |
Is 2 |
Is 3 |
TOTAL |

Predict 1 | 0 | 0 | 0 | 0 |

Predict 2 | 310 | 1862 | 1128 | 3300 |

Predict 3 | 0 | 0 | 0 | 0 |

TOTAL |
310 | 1862 | 1128 | 3300 |

**Table 8b**. Performance measures for naive guess of damage_grade = 2

Performance |
Value |

Avg Accuracy | 0.71 |

F1 Score | 0.56 |

F1 Score (test data) | Not submitted |

The default parameters for the 2-class logistic regression classifier are: Optimization tolerance = 1E-07, L1 regularization weight = 1, L2 regularization weight = 1. Notice in Table 9b the large gap between the F1 score using the training data and the test data. This indicates the model is overfitted.

**Table 9a**. Truth table for ordinal regression model using 2-class logistic regression classifier

Truth table |
Is 1 |
Is 2 |
Is 3 |
TOTAL |

Predict 1 | 84 | 52 | 1 | 137 |

Predict 2 | 223 | 1567 | 536 | 2326 |

Predict 3 | 3 | 243 | 591 | 837 |

TOTAL |
310 | 1862 | 1128 | 3300 |

**Table 9b**. Performance measures for ordinal regression model using 2-class logistic regression classifier

Performance |
Value |

Avg Accuracy | 0.79 |

F1 Score | 0.68 |

F1 Score (test data) | 0.5687 |

The default parameter settings are: Resampling method = Bagging, Trainer mode = Single parameter, Number of decision trees = 8, Maximum depth of trees = 32, Number of random splits per need = 128, minimum number of samples per node = 1.

**Table 10a**. Truth table for ordinal regression model using decision forest classifier

Truth table |
Is 1 |
Is 2 |
Is 3 |
TOTAL |

Predict 1 | 85 | 55 | 1 | 144 |

Predict 2 | 218 | 1335 | 432 | 1985 |

Predict 3 | 7 | 472 | 692 | 1171 |

TOTAL |
310 | 1862 | 1128 | 3300 |

**Table 10b**. Performance measures for ordinal regression model using decision forest classifier

Performance |
Value |

Avg Accuracy | 0.76 |

F1 Score | 0.64 |

F1 Score (test data) | Not submitted |

The default parameter settings are: Number of iterations = 1, Lambda = 0.001.

**Table 11a**. Truth table for ordinal regression model using SVM classifier

Truth table |
Is 1 |
Is 2 |
Is 3 |
TOTAL |

Predict 1 | 31 | 21 | 3 | 55 |

Predict 2 | 273 | 1667 | 932 | 2872 |

Predict 3 | 6 | 174 | 193 | 373 |

TOTAL |
310 | 1862 | 1128 | 3300 |

**Table 11b**. Performance measures for ordinal regression model using SVM classifier

Avg F1 Score0.

Performance |
Value |

Avg Accuracy | 0.72 |

F1 Score | 0.57 |

F1 Score (test data) | 0.5644 |

Unfortunately, none of my initial models performed well. The F1 score never meets the target of 0.70. In fact, in some cases my model don’t do much better than just guessing. (Note: You can see my scores on the contest leaderboard.) In the next section, we will add a cross-validation step and an hyperparameter tuning step to optimize the models and and see if that improves them.

This completes steps 4 to 6 of building a machine learning model. The remaining optimization step and the results are posted at How to create a machine learning model – Part 3.

]]>*Damage caused by Gorkha earthquake. Image from The Guardian*

by George Taniwaki

At the beginning of each quarter edX and Microsoft offer a one-month long course called DAT 102x, Data Science Capstone. The class consists of a single machine learning project using real-world data. The class this past April used data collected by the Nepal government after the Gorkha earthquake in 2015. The earthquake killed nearly 9,000 people and left over 100,000 homeless.

The assignment was to predict damage level for individual buildings based on building characteristics such as age, height, location, construction materials, and use type. Below are the steps I used to solve this problem. The solution is general enough to apply to any machine learning problem. My description is a bit lengthy but shows the iterative nature of tuning a machine learning model.

**About machine learning contests**

The class project is operated as a contest. Students download training and test datasets, create a model using the training dataset, use the model to make predictions for the test dataset, submit their predictions, and see their scored results on a leaderboard.

As is common for machine learning contests, the training data consists of two files. The first file includes a large number of records (for the earthquake project, there were 10,000). Each record consists of an index and the independent variables (38 building parameters). A separate label file contains only two columns, the index and and their associated dependent variable(s) (in the earthquake project, there is only one, called damage_grade). The two files must be joined before creating a model.

The test file (10,000 more records of building indexes and parameters) has the same format as the training file. You use your model to create an output file consisting of the index and the predicted values of the dependent variable(s). You submit the file to a web service which then scores your results.

You can submit multiple predictions to be scored, adjusting your model after each submission in an attempt to improve your score. Your final score is based on the highest score achieved. To reduce the chance that competitors (students) overfit their model to the test data, the score is based on an undisclosed subset of records in the test file.

**Approach to model building**

The general approach to building a machine learning model is to first examine the dependent variables using univariate methods (step 1). Repeat for the independent variables (step 2). Normalize the variables (step 3). Examine correlations using multivariate methods (step 4). Select the relevant variables, choose a model, and build it (step 5). Evaluate and test the model (step 6) and tune the parameters (step 7) to get the best fit without overfitting. Some iteration may be required.

**Step 1: Univariate statistics for the dependent variable**

There is one dependent variable, damage_grade, labeled with an integer from 1 to 3. Higher values mean worse damage. However, the intervals between each class are not constant, so the scale is ordinal not interval. Descriptive statistics are shown in Table 1 and Figure 1 below.

**Table 1**. Descriptive statistics for dependent variable

Variable name |
Min |
Median |
Max |
Mean |
Std dev |

damage_grade | 1 | 2 | 3 | 2.25 | 0.61 |

**Figure 1.** Histogram of dependent variable

**Step 2: Univariate statistics for the independent variables**

As mentioned above, there are 38 building parameters. Details of the variables are given at Data Science Capstone. The 38 independent variables can be divided into 4 classes, binary, interval integer, float, and categorical as shown in Table 2 below. Notice that the parameter count_families is defined as a float even though it only takes on integer values.

**Table 2.** Summary of independent variables

Variable type |
Quantity |
Examples |

Binary (Boolean) | 22 | has_superstructure_XXXX, has_secondary_use, has_secondary_use_XXXX |

Integer, interval | 4 | count_floors_pre_eq, height, age, area |

Float | 1 | count_families |

Categorical | 11 | geo_level_1_id, geo_level_2_id, geo_level_3_id, land_surface_condition, foundation type, roof_type, ground_floor_type, other_floor_type, position, plan_configuration, legal_ownership_status |

The binary variables fall into three groups. First is has_superstructure_XXXX, where XXXX can be 11 possible materials used to produce the building superstructure such as adobe mud, timber, bamboo, etc. The second is has_secondary_use_XXXX, where XXXX can be 10 possible secondary uses for the building such as agriculture, hotel, school, etc. Finally, has_secondary_use indicates if any has_secondary_use_XXXX variables is true.

Whenever I have groups of binary variables, I like to create new interval integer variables based on them. In this case, they are named count_superstructure and count_secondary_use which are a count of the number of true values for each. count_superstructure can vary from 0 to 11 while count_secondary_use can vary from 0 to 10.

For the 7 numerical parameters, their minimum, median, maximum, mean, standard deviation, and histogram are shown in Table 3 and Figure 2 below. Possible outliers, which occur in all 7 numerical variables, are highlighted in red.

**Table 3.** Descriptive statistics for the 7 numerical variables. Red indicates possible outliers

Variable name |
Min |
Median |
Max |
Mean |
Std dev |

count_floors_pre_eq | 1 | 2 | 9 | 2.15 | 0.74 |

height | 1 | 5 | 30 | 4.65 | 1.79 |

age | 0 | 15 | 995 | 25.39 | 64.48 |

area | 6 | 34 | 425 | 38.44 | 21.27 |

count_families | 0 | 1 | 7 | 0.98 | 0.42 |

count_superstructure | 1 | 1 | 8 | 1.45 | 0.78 |

count_secondary_use | 0 | 0 | 2 | 0.11 | 0.32 |

**Figure 2.** Histograms for the 7 numerical variables. Red indicates possible outliers

Upon inspection of the dataset, none of the numerical variables have missing values. However, it appears that for 40 records, age has an outlier values of 995. The next highest value is 200. Further, the lowest age value is zero, which does not work well with the log transform. To clean the data, I created a new binary variable named has_missing_age, with value = 1 if age = 995, else = 0 otherwise. I also created a new numerical variable named adjust_age, with value = 1 if age = 0, else = 15 (the median) if age = 995, else = age otherwise.

The variable area has a wide range, but does not appear to contain any outliers.

The variables count_families, count_superstructure, and count_secondary_use do not seem to have any outliers. They also do not need to be normalized.

For the 11 categorical variables, the number of categories (enums), and the names of the enums with the largest and smallest counts are shown in Table 4 below.

**Table 4**. Descriptive statistics for the 11 categorical variables. Red indicates one or more enums has fewer than 10 recorded instances

Variable name |
Count enums |
Max label / count |
Min label / count |
Comments |

geo_level_1_id | 31 | 0 / 903 | 30 / 7 | hierarchical |

geo_level_2_id | 1137 | 0 / 157 | tie / 1 | hierarchical |

geo_level_3_id | 5172 | 7 / 24 | tie / 1 | hierarchical |

land_surface_condition | 3 | d502 / 8311 | 2f15 / 347 | |

foundation type | 5 | 337f / 8489 | bb5f / 48 | |

roof_type | 3 | 7g76 / 7007 | 67f9 / 579 | |

ground_floor_type | 5 | b1b4 / 8118 | bb5f / 10 | |

other_floor_type | 4 | f962 / 6412 | 67f9 / 415 | correlated with ground_floor_type |

position | 4 | 3356 / 7792 | bcab / 477 | |

plan_configuration | 9 | a779 / 9603 | cb88 / 1 | lumpy distribution |

legal_ownership_status | 4 | c8e1 / 9627 | bb5f / 61 | lumpy distribution |

None of the categorical or binary variables have missing values. None of the enums are empty, though some of the enums (highlighted in red in Table 4 above) have fewer than 10 recorded instance, so may bias the model. Unfortunately, we do not have any information about the meaning of the enum labels, so do not have a good way to group the sparse enums into larger categories. If this were a real-world problem I would take time to investigate this issue.

**Step 3: Normalize the numerical variables**

As can be seen in Figure 2 above, some of the independent numerical variables are skewed toward low values and have long tails. I normalized the distributions by creating new variables using the log transform function. The resulting variables have names prefixed with ln_. The descriptive statistics of the normalized variables are shown in Table 5 and Figure 3 below.

**Table 5.** Descriptive statistics for adjusted and normalized numerical parameters

Variable name |
Min |
Median |
Max |
Mean |
Std dev |

ln_count_floors_pre_eq | 0 | 0.69 | 2.19 | 0.70 | 0.36 |

ln_height | 0 | 1.61 | 3.40 | 1.46 | 0.40 |

ln_adjust_age | 0 | 2.71 | 5.30 | 2.61 | 1.12 |

ln_area | 1.79 | 3.53 | 6.05 | 3.54 | 0.46 |

**Figure 3**. Histograms of adjusted and normalized numerical parameters

This completes the first 3 steps of building a machine learning model. Steps 4 to 6 to solve this contest problem are posted at How to create a machine learning model – Part 2. The final optimization step 7 is posted at Part 3.

]]>by George Taniwaki

While working toward my Microsoft Data Science Certificate (see Jul 2017 blog post), I also completed the Microsoft Excel for the Data Analyst XSeries Program sponsored by Microsoft and edX.

There are 3 classes in the program, two of which were also courses for the Microsoft Data Science Certificate.

This basic course on data analysis using Excel covers pivot tables, using SUMIF() and SUMIFS() functions to create dashboards and year-over-year comparison tables (something that is not possible using pivot tables alone), creating reports with hierarchal data, using Power Pivot, and creating multi-table reports using the data model and the More tables… feature.

Year-over-year comparison tables can also be created using the Excel data model and time intelligence functions. These are covered in the third course, DAT 206x.

Time: Since I am already an experienced Excel user, I skipped the videos and just did the homework. I covered the 8 modules in about 6 hours

Score: I missed 1 quiz question and no lab questions for a combined score of 99%

[I took DAT222x in order to earn the Microsoft Data Science Certificate. This section is copied from this Jul 2017 blog post.]

This class is comprehensive and covers all the standard statistics and probability topics including descriptive statistics, Bayes rule, random variables, central limit theorem, sampling and confidence interval, and hypothesis testing. Most analysis is conducted using the Data analysis pack add-in for Excel.

Time: I used to work in market research, so I know my statistics. However, there are 36 homework assignments and it took me over 20 hours to complete the 5 modules.

Score: I missed 9 questions on the quizzes (88%) and six in the final exam (81%) for a combined score of 86%. (Despite the time it takes to complete, homework counts very little toward the final grade)

Topics include importing data and using queries with Excel, the Excel data model, using the M query language and DAX query language, creating dashboards and visualizations, and using Excel with Power BI.

Within the M language, topics include using the functions in the ribbon, including filtering rows, Table.Unpivot function.

Within the DAX language, topics include using the X functions like SUMX() and the CALCULATE() function, using Calendar table and time intelligence, and customize pivot tables and pivot charts using the CUBE functions from multidimensional expressions (MDX) language. The CUBE functions can also generate a table that can be used to create chart types that Excel does not support directly from a pivot table (for instance the new treemap, sunburst, and histogram charts).

Time: I am an experienced Excel user, but some of the advanced DAX functions were new to me. 6 hours for 8 modules

Score: I got a bit sloppy. I missed 2 lab questions and 1quiz question for a combined score of 95%

Below is my certificate of completion for the Microsoft Excel for the Data Analyst XSeries Program.

]]>by George Taniwaki

Big data and machine learning are all the rage now. Articles in the popular press inform us that anyone who can master the skills needed to turn giant piles of previously unexplored data into golden nuggets of business insight can write their own ticket to a fun and remunerative career (efinancialcareers May 2017).

Conversely, the press also tells us that if we don’t learn these skills a computer will take our job (USA Today Mar 2014). I will have a lot more to say about changes in employment and income during the industrial revolution in future blog posts.

But how do you learn to become a data scientist. And which software stack should one specialize in? There are many tools to choose from. Since I live in the Seattle area and do a lot of work for Microsoft, I decided to do take an online class developed and sponsored by Microsoft and edX. Completion of the course leads to a Microsoft Data Science Certificate.

The program consists of 10 courses with some choices, like conducting analysis using either Excel or Power BI, and programming using either R or Python. Other parts of the Microsoft stack you will learn include SQL Server for queries and Microsoft Azure Machine Learning (MAML) for analysis and visualization of results. The courses are priced about $99 each. You can audit them for free if you don’t care about the certificates.

I started the program in February and am about half way done. In case any clients or potential employers are interested in my credentials, my progress is shown below.

If you haven’t been in college in a while or have never taken an online class, this is a good introduction to online learning. The homework consists of some simple statistics and visualization problems.

Time: 3 hours for 3 modules

Score: 100% on 3 assignments

I took a t-SQL class online at Bellevue College two years ago. Taking a class with a real teacher, even one you never meet, was a significantly better experience than a self-paced mooc. This course starts with the basics like select, subqueries, and variables. It also covers intermediate topics like programming, expressions, stored procedures, and error handling. I did my homework using both a local instance of SQL Server and on an Azure SQL database.

Time: 20 hours for 11 modules

Score: I missed one question in the homework and two in the final exam for a combined score of 94%

I already have experience creating reports using Power BI. I also use Power Query (now called get and transform data) and M language and Power Pivot and DAX language, so this was an easy class.

The course covers data transforms, modeling, visualization, Power BI web service, organization packs, security and groups. It also touches on the developer API and building mobile apps.

Time: 12 hours for 9 modules

Score: I missed one lab question for a combined score of 98%

This class is comprehensive and covers all the standard statistics and probability topics including descriptive statistics, Bayes rule, random variables, central limit theorem, sampling and confidence interval, and hypothesis testing. Most analysis is conducted using the Data analysis pack add-in for Excel.

Time: I used to work in market research, so I know my statistics. However, there are 36 homework assignments and it took me over 20 hours to complete the 5 modules.

Score: I missed 9 questions on the quizzes (88%) and six in the final exam (81%) for a combined score of 86%. (Despite the time it takes to complete, homework counts very little toward the final grade)

Now we are getting into the meat of the program. R is a functional language. In many ways it is similar to the M language used in Power Query. I was able to quickly learn the syntax and grasp the core concepts.

The course covers vectors, matrices, factors, lists, data frames, and simple graphics.

The lab assignments use DataCamp which has a script window where you write code and a console window that displays results. That makes it easy to debug programs as you write them.

The final exam used an unexpected format. It was timed and consisted of about 50 questions, mostly fill-in-the-blank responses that include code snippets. You are given 4 minutes per question. If you don’t answer within the time limit, it goes to the next question. I completed the test in about 70 minutes, but I ran out of time on several questions, and was exhausted at the end. I’m not convinced that a timed test is the best way to measure subject mastery by a beginning programmer. But maybe that is just rationalization on my part.

Time: 15 hours for 7 modules

Score: I got all the exercises (ungraded) and labs right and missed two questions in the quizzes. I only got 74% on the final, for a combined score of 88%

The first three modules in this course covered statistics and was mostly a repeat of the material introduced in DAT222x. But the rest of the course provides an excellent introduction to machine learning. You learn how to create a MAML instance, import a SQL query, manipulate it using R or Python, create a model, score it, publish it as a web service, and use the web service to append predictions as a column in Excel. I really like MAML. I will post a review of my experience in a future blog post.

The course was a little too cookbook-like for my taste. It consisted mostly of following directions to drag-drop boxes onto the canvas UI and copy-paste code snippets into the panels. However, if you want a quick introduction to machine learning without having to dig into the details of SQL, R, or Python, this is a great course.

Time: 10 hours for 6 modules

Score: 100% on the 6 labs and the final

I have now completed six out of the ten courses required for a certificate. I expect to finish the remaining 4 needed for a certificate by the end of the year. I will also probably take some of the other elective courses simply to learn more about Microsoft’s other machine learning and cloud services.

For my results in the remaining classes, see Microsoft Data Science Certificate-Part 2

Update: Modified the description of the final exam for DAT204x.

]]>I’m a libertarian by nature. (That’s libertarian with a small L, meaning I believe in government transparency and clarity. Please don’t confuse it with Libertarian with a capital L, which I associate with mindless anarchy.) Every two year, I dutifully check for my ballot and voter pamphlet (Washington has voter by mail). The number of items seems to be getting longer, especially voter initiatives.

Here is my method of deciding how to cast my ballot on voter initiatives. First, I start skeptically. Most voter initiatives are funded by political extremists who do not consider the consequences of adopting their pet idea. But I do my online research, checking analysis produced by hopefully reputable and unbiased sources. Ultimately though, I usually vote against them.

This year in Washington, there a really bizarre ballot issue. It is Initiative Measure No. 1501. “Increased Penalties for Crimes Against Vulnerable Individuals”

This measure would increase the penalties for criminal identity theft and civil consumer fraud targeted at seniors or vulnerable individuals; and exempt certain information of vulnerable individuals and in-home caregivers from public disclosure.

Should this measure be enacted into law? Yes [ ] No [ ]

How could anyone be against this? We want to help seniors, right? Well, it’s not that simple.

There is a very complex story about this initiative. It involves a union, an antiunion think tank, and the U.S. Supreme Court. Initiative 1501 is sponsored by the Service Employees International Union (SEIU) that represents healthcare workers that work in nursing homes or provide in-home care. Washington, like most states, requires certain workers, such as nurses, to have a license in order to provide services to the public. About one-third of all service workers in the U.S. require licenses. In many cases, these workers are also unionized.

Enter the Freedom Foundation. This antiunion policy group is headquartered in Olympia, Washington. It was founded by Bob Williams, who was formerly with the American Legislative Exchange Council (ALEC). You may have heard of ALEC; it is a corporate funded lobbying group that writes model legislation (which obviously is designed to further the goals of its corporate clients) which it then provides to state legislators to review. The legislators can then submit the bills for approval into law. The Freedom Foundation provides very similar services.

In 2014, the U.S. Supreme Court ruled 5-4 in *Harris v. Quinn* that an Illinois state law that allowed the SEIU to collect a representation fee (union dues) from in-home healthcare workers wages was unconstitutional. The reasoning was that the fee violated the First Amendment rights of the workers to not provide financial support for collective bargaining.

After the ruling, the Freedom Foundation complained that the SEIU was not doing enough to inform its members that they did not have to pay the representation fee in order to belong to the union. Though a public records act, it sued the union and the state, won, and started to send communications to members encouraging them to stop paying the fee.

Since a Supreme Court ruling covers the entire U.S., not just Illinois, the SEIU realized that it was very vulnerable to attack by the Freedom Foundation or other antiunion organizations.

In Washington, the SEIU proactively sponsored Initiative 1501 as a direct attack against Freedom Foundation. The SEIU wants to avoid having to release the names, addresses, and phone numbers of its members (or having the state reveal these either). Initiative 1501 does this by saying that in-home caregivers are a protected class, like seniors or vulnerable individuals, that the state and the union cannot release personal information about.

After all that research, the story starts to make sense. This is a battle between two parties that a libertarian like me dislikes. But more transparency is better than less. So I will vote no. Sorry seniors and vulnerable individuals, you will have to rely on existing statutes to protect you.

]]>Did you watch the debate on Monday night? I did. But I am also very interested in the post-debate media coverage and analysis. This morning, two articles that combine big data and the debate caught my eye. Both are novel and much more interesting than the tired stories that simply show changes in polls after a debate.

First, the New York Time reports that during the presidential debate (between 9:00 and 10:30 PM EDT) there is high correlation between the Betfair prediction market for who will win the presidential election and afterhours S&P 500 futures prices (see chart 1).

*Chart 1. Betfair prediction market for Mrs. Clinton compared to S&P 500 futures. Courtesy of New York Times*

Correlation between markets is not a new phenomena. For several decades financial analysts have measured the covariance between commodity prices, especially crude oil, and equity indices. But this is the first time I have seen an article illustrating the covariance between a “fun” market for guessing who will become president against a “real” market. Check out the two graphs above, the similarity in shape is striking, including the fact that both continue to rise for about an hour after the debate ended.

In real-time, while the debate was being broadcast, players on Betfair believed the chance Mrs. Clinton will win the election rose by 5 percent. Meanwhile, the price of S&P 500 futures rose by 0.6%, meaning investors (who may be the same speculators who play on Betfair) believed the stock market prices in November were likely to be higher than before the debates started. There was no other surprise economic news that evening, so the debate is the most likely explanation for the surge. Pretty cool.

If the two markets are perfectly correlated (they aren’t) and markets are perfectly efficient (they aren’t), then one can estimate the difference in equity futures market value between the two candidates. If a 5% decrease in likelihood of a Trump win translates to a 0.6% increase in equity futures values, then the difference between Mr. Trump or Mrs. Clinton being elected (a 100% change in probability) results in about a 12% or $1.2 trillion (the total market cap of the S&P 500 is about $10 trillion) change in market value. (Note that I assume perfect correlation between the S&P 500 futures market and the actual market for the stocks used to calculate the index.)

Further, nearly all capital assets (stocks, bonds, commodities, real estate) in the US are now highly correlated. So the total difference is about $24 trillion (assuming total assets in the US are $200 trillion). Ironically, this probably means Donald Trump would be financially better off if he were to lose the election.

******

The other article that caught my eye involves Google Trend data. According to the Washington Post, the phrase “registrarse para votar” was the third highest trending search term the day after the debate was broadcast. The number of searches is about four times higher than in the days prior to the debates (see chart 2). Notice the spike in searches matches a spike in Sep 2012 after the first Obama-Romney debate.

The article says that it is not clear if it was the debate itself that caused the increase or the fact that Google recently introduced Spanish-language voting guides to its automated Knowledge Box, which presumably led to more searches for “registrarse para votar”. (This is the problem with confounding events.)

After a bit of research, I discovered an even more interesting fact. The spike in searches did not stop on Sep 27. Today, on Sep 30, four days after the debates, the volume of searches is 10 times higher than on Sep 27, or a total of 40x higher than before the debate (see chart 3). The two charts are scaled to make the data comparable.

*Chart 2. Searches for “registrarse para votar” past 5 years to Sep 27. Courtesy of Washington Post and Google Trends*

*Chart 3. Searches for “registrarse para votar” past 5 years to Sep 30. Courtesy of Google Trends*

I wanted to see if the spike was due to the debate or due to the addition of Spanish voter information to the Knowledge Box. To do this, I compared “registrarse para votar” to “register to vote”. The red line in chart 4 shows Google Trend data for “register to vote” scaled so that the bump in Sept 2012 is the same height as in the charts above. I’d say the debate really had an unprecedented effect on interest in voting and the effect was probably bigger for Spanish speaking web users.

*Chart 4. Searches for “register to vote” past 5 years to Sep 30. Courtesy of Google Trends*

Finally, I wanted to see how the search requests were distributed geographically. The key here is that most Hispanic communities vote Democratic and many states with a large Hispanic population are already blue (such as California, Washington, New Mexico, New Jersey, and New York). The exception is Florida with a large population of Cuban immigrants who tend to vote Republican.

*Chart 5. Searches for “registrarse para votar” past 5 years to Sep 30 by county. Courtesy of Google Trends*

If you are a supporter of Democrats like Mrs. Clinton, the good news is that a large number of queries are coming from Arizona, and Texas, two states where changes in demographics are slowly turning voting preferences from red to blue.

In Florida, it is not clear which candidate gains from increased number of Spanish-speaking voters. However, since the increase is a result of the debate (during which it was revealed that Mr. Trump had insulted and berated a beauty pageant winner from Venezuela, calling her “miss housekeeping”), I will speculate many newly registered voters are going to be Clinton supporters.

If the Google search trend continues, it may be driven by new reports that Mr. Trump may have violated the US sanctions forbidding business transactions in Cuba. Cuban-Americans searching for information on voter registration after hearing this story are more likely to favor Mrs. Clinton.

]]>