NepalEarthquake3

Town of Barpak after Gorkha earthquake. Image from The Telegraph (UK)

 

by George Taniwaki

This is the final set of my notes from a machine learning class offered by edX. Part 1 of this blog entry is posted in June 2018.

Step 7: Optimize model

At the end of step 6, I discovered that none of my three models met the minimum F score (at least 0.60) needed to pass the class. Starting with the configuration shown in Figure 5, I modified my experiment by replacing the static data split with partition and sampling using 10 evenly split folds. I used a random seed of 123 to ensure reproducibility.

I added both a cross-validation step and a hyperparameter tuning step to optimize results. To improve performance, I added a Convert to indicator values module. This converts the categorical variables into dummy binary variables before running the model.

Unfortunately, the MAML ordinal regression module does not support hyperparameter tuning. So I replaced it with the one-vs-all multiclass classifier. The new configuration is shown in Figure 6 below. (Much thanks to my classmate Robert Ritz for sharing his model.)

Figure 6. Layout of MAML Studio experiment with hyperparameter tuning

Fig6MamlConfig2

For an explanation of how hyperparameter tuning works, see Microsoft documentation and MSDN blog post.

Model 5 – One-vs-all multiclass model using logistic regression classifier

In the earlier experiments, the two-class logistic regression classifier gave the best results. I will use it again with the one-vs-all multiclass model. The default parameter ranges for the two-class logistic regression classifier are: Optimization tolerance = 1E-4, 1E-7, L1 regularization weight = 0 .01, 0.1, 1.0, L2 regularization weight = 0.01, 0.1, 1.0, and memory size for L-BFGS = 5, 20, 50.

Table 12a. Truth table for one-vs-all multiclass model using logistic regression classifier

Truth table Is 1 Is 2 Is 3 TOTAL
Predict 1 296 156 20 472
Predict 2 621 4633 1651 6905
Predict 3 21 847 1755 2623
TOTAL 936 5636 3426 10000

 

Table 12b. Performance measures for one-vs-all multiclass model using logistic regression classifier

Performance Value
Avg Accuracy 0.76
F1 Score 0.64
F1 Score (test data) Not submitted

 

The result is disappointing. The new model has an F1 score of 0.64, which is lower than the F1 score of the ordinal regression model using the logistic regression classifier.

Model 6 – Add geo_level_2 to model

Originally, I excluded geo_level_2 from the model even though the Chi-square test was significant because it consumed too many degrees of freedom. I rerun the experiment with the variable and keeping all other variables and parameters the same.

Table 13a. Truth table for one-vs-all multiclass model using logistic regression classifier and including geo_level_2

Truth table Is 1 Is 2 Is 3 TOTAL
Predict 1 355 218 27 600
Predict 2 564 4662 1446 6672
Predict 3 19 756 1953 2728
TOTAL 938 5636 3426 10000

 

Table 13b. Performance measures for one-vs-all multiclass model using logistic regression classifier and including geo_level_2

Performance Value
Avg Accuracy 0.80
F1 Score 0.70
F1 Score (test data) Not submitted

 

The resulting F1 score using the test dataset is 0.70, which is better than any prior experiments and meets our target of 0.70 exactly.

Model 7 – Add height/floor to the model

I will try to improve the model by adding a variable measuring height/floor. This variable is always positive, skewed toward zero and has a long tail. To normalize it, I apply the natural log transform and name the variable ln_height_per_floor. Table 14 and Figure 7 show the summary statistics.

Table 14. Descriptive statistics for ln_height_per_floor

Variable name Min Median Max Mean Std dev
ln_height_per_floor -1.79 0.69 2.30 0.76 0.25

 

Figure 7. Histogram of ln_height_per_floor

Fig7HistLn_height_per_floor

I run the model again with no other changes.

Table 15a. Truth table for one-vs-all multiclass model using logistic regression classifier, including geo_level_2, height/floor

Truth table Is 1 Is 2 Is 3 TOTAL
Predict 1 366 227 28 621
Predict 2 557 4640 1436 6633
Predict 3 15 769 1962 2746
TOTAL 938 5636 3426 10000

 

Table 15b. Performance measures for one-vs-all multiclass model using logistic regression classifier, including geo_level_2, height/floor

Performance Value
Avg Accuracy 0.80
F1 Score 0.70
F1 Score (test data) Not submitted

 

The accuracy of predicting damage_level = 1 or 3 increases, but the accuracy of 2 decreases. Resulting in no change in average accuracy or the F1 score.

Model 8 – Go back to ordinal regression

The accuracy of the one-vs-all multiclass model was significantly improved by adding geo_level_2. Let’s see what happens if I add this variable to the ordinal regression model which produced a higher F1 score than the one-vs-all model.

Table 16a. Truth table for ordinal regression model using logistic regression classifier, including geo_level_2, height/floor

Truth table Is 1 Is 2 Is 3 TOTAL
Predict 1 80 59 1 140
Predict 2 227 1557 542 2326
Predict 3 3 246 585 834
TOTAL 310 1862 1128 3300

 

Table 16b. Performance measures for ordinal regression model using logistic regression classifier, including geo_level_2, height/floor

Performance Value
Avg Accuracy 0.78
F1 Score 0.67
F1 Score (test data) Not submitted

 

Surprisingly, ordinal regression produces worse results when the geo_level_2 variable is included than without it.

Model 9 – Convert numeric to categorical

I spent a lot of effort adjusting and normalizing my numeric variables. They were mostly integer values with small range and did not appear to be correlated to damage_grade. Could the model be improved by treating them as categorical? Let’s find out.

First I perform a Chi Square test to confirm all of the variables are significant. Then run the model after converting all the values from numeric to strings, and converting all the variables from numeric to categorical.

Table 17. Chi-square results of numerical values to damage_grade

Variable name Chi-square Deg. of freedom P value
count_floor_pre_eq 495 14 < 2.2E-16*
height 367 37 < 2.2e-16*
age 690 60 < 2.2e-16*
area 738 314 < 2.2e-16*
count_families 76 14 1.3e-10*
count_superstructure 104 14 7.1e-16*
count_secondary_use 79 4 3.6e-16*

*One or more enums have sample sizes too small to use Chi-square approximation
[ ] P value greater than 0.05 significance level

Table 18a. Truth table for ordinal regression model using logistic regression classifier, including geo_level_2, height/floor, and converting numeric to categorical

Truth table Is 1 Is 2 Is 3 TOTAL
Predict 1 83 62 3 148
Predict 2 224 1544 540 2308
Predict 3 3 256 585 844
TOTAL 310 1862 1128 3300

 

Table 18b. Performance measures for ordinal regression model using logistic regression classifier, including geo_level_2, height/floor, and converting numeric to categorical

Performance Value
Avg Accuracy 0.78
F1 Score 0.67
F1 Score (test data) Not submitted

 

Changing the integer variables to categorical has almost no impact on the F1 score.

Conclusion

Table 19 below summarizes all nine models I built. Six of them achieved an F1 score of 0.60 or higher on the training data, which would probably have been sufficient to pass the class. Two of them had F1 score of 0.70 which would be a grade of 95 out of 100.

I was unable to run most of these models on the test dataset and submit the results to the data science capstone website. Thus, I do not know what my leaderboard F1 score would be. It is possible that I overfit my model to the training data and my leaderboard F1 score might be lower.

Finding the best combination of variables, models, and model hyperparameters is difficult to do manually. It took me several hours to build the nine models described in this blog post. Machine learning automation tools exist but are not yet robust, nor built into platforms like MAML Studio. (Much thanks again to Robert Ritz who pointed me to TPOT, a Python-based tool for auto ML.)

Table 19. Summary of models. Green indicates differences from base case, model 2

Model Variables Algorithm Training data F1 score (test data)
1 None Naïve guess = 2 None 0.56
3 27 from Table 5 Ordinal regression with decision forest 0.67 split 0.64
4 27 from Table 5 Ordinal regression with SVM 0.67 split 0.57 (0.5644)
2 27 from Table 5 Ordinal regression with logistic regression 0.67 split 0.68 (0.5687)
5 27 from Table 5 One-vs-all multiclass with logistic regression, hyperparameter tuning 10-fold partition 0.64
6 27 from Table 5, geo_level_2 One-vs-all multiclass with logistic regression, hyperparameter tuning 10-fold partition 0.70
7 27 from Table 5, geo_level_2, height/floor One-vs-all multiclass with logistic regression, hyperparameter tuning 10-fold partition 0.70
8 27 from Table 5, geo_level_2, height/floor Ordinal regression with logistic regression 0.67 split 0.67
9 27 from Table 5, convert numeric to categorical, geo_level_2, height/floor Ordinal regression with logistic regression 0.67 split 0.67
Advertisements

NepalEarthquake2

Damage caused by Gorkha earthquake. Image by Prakash Mathema/AFP/Getty Images

by George Taniwaki

This is a continuation of my notes for a machine learning class offered by edX. Part 1 of this blog entry is posted in June 2018.

Step 4: Multivariate analysis

Pairwise scatterplots of the numerical variables after adjusting and normalizing are shown in Figure 4 below. The dependent variable (damage_grade) does not appear to be correlated with any of the numerical independent variables. Despite the lack of correlation, I included all the numeric variables when building the model. If I have time, I will convert these numerical variables into categorical ones.

Among the independent variables, covariance is highest between count_floors_pre_eq and height highlighted in green. This makes sense, taller buildings are likely to have more floors. If I have time, I will add a new variable height_per_floor (= height / count_floors_pre_eq).

Figure 4. Pairwise scatterplots of all numerical parameters. Correlations highlighted in green

Fig4Scatterplot

There are 11 categorical and 23 binary parameters. I used the Chi-square test to compare distributions of these to the distribution of the dependent variable damage_grade, treated as categorical. The results are shown in Table 5 below.

All are statistically significant at 0.05 level, except for the 5 highlighted in red brackets. They will be excluded from the model. Two of the geo_level variables consume too many degrees of freedom given our sample size. (Even big datasets have limitations.) They are highlighted in orange and will be excluded from the model. If I have time, I will add a new custom geo_level variable with about 1000 degrees of freedom. The remaining 27 variables will be retained for use in the model.

Table 5. Chi-square results of categorical and Boolean values to damage_grade

Variable name Chi-square Deg. of freedom P value
geo_level_1_id 2746 60 < 2.2e-16*
geo_level_2_id 6592 2272 < 2.2e-16*
geo_level_3_id 13039 10342 < 2.2e-16*
land_surface_condition 15.9 4 0.0032
foundation_type 1857 8 < 2.2e-16*
roof_type 1122 4 < 2.2e-16
ground_floor_type 1347 8 < 2.2e-16*
other_floor_type 1117 6 < 2.2e-16
position 47.3 6 1.6e-08
plan_configuration 46.5 16 8.0e-05*
legal_ownership_status 80.2 6 3.2e-15
has_superstructure_adobe_mud 50.3 2 1.1e-11
has_superstructure_mud_mortar_stone 1053 2 < 2.2e-16
has_superstructure_stone_flag 28.6 2 6.2e-07
has_superstructure_cement_mortar_stone 53.8 2 2.1e-12
has_superstructure_mud_mortar_brick 37.5 2 7.3e09
has_superstructure_cement_mortar_brick 632 2 < 2.2e-16
has_superstructure_timber 64.9 2 8.0e-15
has_superstructure_bamboo 55.1 2 1.1e-12
has_superstructure_rc_non_engineered 296 2 < 2.2e-16
has_superstructure_rc_engineered 603 2 < 2.2e-16*
has_superstructure_other 9.8 2 0.0072
has_secondary_use 76.2 2 < 2.2e-16
has_secondary_use_agriculture 25.0 2 3.8e-06
has_secondary_use_hotel 90.9 2 < 2.2e-16
has_secondary_use_rental 49.4 2 1.9e-11
has_secondary_use_institution 10.8 2 0.0046*
has_secondary_use_school 32.1 2 1.1e-07*
has_secondary_use_industry 2.3 2 [ 0.31 ]*
has_secondary_use_health_post 1.5 2 [ 0.46 ]*
has_secondary_use_gov_office 4.2 2 [ 0.12 ]*
has_secondary_use_use_police 0.8 2 [ 0.68 ]*
has_secondary_use_other 15.3 2 0.00049*
has_missing_age 3.1 2 [ 0.21 ]*

*One or more enums have sample sizes too small to use Chi-square approximation
[ ] P value greater than 0.05 significance level

Step 5: Building the model

The dependent variable can take on the value 1, 2, or 3. I could use a classification method like multi-class logistic regression to create our model. However, there is a better way. I will use the ordinal regression algorithm available from Microsoft Azure Machine Learning (MAML).

Ordinal regression requires a binary classifier method. For this project, I will try three classifiers available in MAML, two-class logistic regression, two-class decision forest, and two-class support vector machine (SVM). I will use the default parameters for each classifier and submit all three models as entries in the contest. (This is sort of a cheat to improve the F1 score and my grade. In practice, you should only submit the results using the best model based on the training data.)

A simple model configuration using a static data split and without either a cross-validation step or a hyperparameter tuning step to optimize results is shown in Figure 5 below. I will add these steps later if the simple model does not meet the performance goals.

Figure 5. Layout of a simple MAML Studio experiment

Fig5MamlConfig1

I used a data split of 0.67, meaning 6,700 records were used to train the model and the remaining 3,300 were used to score and evaluate it. I used a random number seed of 123 to ensure every run of my experiment used the same split and produced replicable results.

Step 6: Measuring performance and testing results

A generic truth table for an experiment with 3 outcomes is shown in Table 6a below. Using data from the truth table, four performance metrics can be calculated, average accuracy, micro-average precision , micro-average recall, and geometric average F1 score. The calculation of the performance metrics is shown in Table 6b. Note that since all combinations are measured, recall, precision, and F1 score are all equal. Perhaps the contest should have used the macro-average recall and precision to calculate the F1 score.

Table 6a. Generic truth table for case with 3 outcomes

Truth table Is 1 Is 2 Is 3 TOTAL
Predict 1 TP1|TN2|TN3 FP1|FN2|TN3 FP1|TN2|FN3 TP1+FP1
Predict 2 FN1|FP2|TN3 TN1|TP2|TN3 TN1|FP2|FN3 TP2+FP2
Predict 3 FN1|TN2|FP3 TN1|FN2|FP3 TN1|TN2|TP3 TP3+FP3
TOTAL TP1+FN1 TP2+FN2 TP3+FN3 Pop

 

Table 6b. Performance measures for case with 3 outcomes

Performance Calculation
Avg Accuracy ∑((TP + TN) / Pop) / 3
Avg Precision (micro) P = ∑TP / ∑(TP + FP)) = ∑TP / Pop
Avg Recall (micro) R = ∑TP / ∑(TP + FN)) = ∑TP / Pop
F1 Score (2 * P * R) / (P + R) = ∑TP / Pop

 

Grading for the course is based on the F1 score for a hidden subset of the test dataset as shown in Table 7 below. F1 scores between these points will receive linearly proportional grades. For instance,  an F1 score of 0.65 would earn a grade of 75.

Table 7. Grading of project based on F1 score (test data)

F1 Score Grade
< 0.60 1 out of 100
0.60 60/100
0.64 70/100
0.66 80/100
0.70 95/100

 

Below are the results of my models.

Model 1 – Naïve guessing

The most common value for damage_grade is 2. Any prediction model should perform better than the naïve guess of predicting the damage is 2 for all buildings.

Table 8a. Truth table for naive guess of damage_grade = 2

Truth table Is 1 Is 2 Is 3 TOTAL
Predict 1 0 0 0 0
Predict 2 310 1862 1128 3300
Predict 3 0 0 0 0
TOTAL 310 1862 1128 3300

 

Table 8b. Performance measures for naive guess of damage_grade = 2

Performance Value
Avg Accuracy 0.71
F1 Score 0.56
F1 Score (test data) Not submitted

 

Model 2 – Ordinal regression model using 2-class logistic regression classifier

The default parameters for the 2-class logistic regression classifier are: Optimization tolerance = 1E-07, L1 regularization weight = 1, L2 regularization weight = 1. Notice in Table 9b the large gap between the F1 score using the training data and the test data. This indicates the model is overfitted.

Table 9a. Truth table for ordinal regression model using 2-class logistic regression classifier

Truth table Is 1 Is 2 Is 3 TOTAL
Predict 1 84 52 1 137
Predict 2 223 1567 536 2326
Predict 3 3 243 591 837
TOTAL 310 1862 1128 3300

 

Table 9b. Performance measures for ordinal regression model using 2-class logistic regression classifier

Performance Value
Avg Accuracy 0.79
F1 Score 0.68
F1 Score (test data) 0.5687

 

Model 3 – Ordinal regression model using 2-class decision forest classifier

The default parameter settings are: Resampling method = Bagging, Trainer mode = Single parameter, Number of decision trees = 8, Maximum depth of trees = 32, Number of random splits per need = 128, minimum number of samples per node = 1.

Table 10a. Truth table for ordinal regression model using decision forest classifier

Truth table Is 1 Is 2 Is 3 TOTAL
Predict 1 85 55 1 144
Predict 2 218 1335 432 1985
Predict 3 7 472 692 1171
TOTAL 310 1862 1128 3300

 

Table 10b. Performance measures for ordinal regression model using decision forest classifier

Performance Value
Avg Accuracy 0.76
F1 Score 0.64
F1 Score (test data) Not submitted

 

Model 4 – Ordinal regression model using 2-class support vector machine (SVM) classifier

The default parameter settings are: Number of iterations = 1, Lambda = 0.001.

Table 11a. Truth table for ordinal regression model using SVM classifier

Truth table Is 1 Is 2 Is 3 TOTAL
Predict 1 31 21 3 55
Predict 2 273 1667 932 2872
Predict 3 6 174 193 373
TOTAL 310 1862 1128 3300

 

Table 11b. Performance measures for ordinal regression model using SVM classifier
Avg F1 Score0.

Performance Value
Avg Accuracy 0.72
F1 Score 0.57
F1 Score (test data) 0.5644

 

Summary of models in step 6

Unfortunately, none of my initial models performed well. The F1 score never meets the target of 0.70. In fact, in some cases my model don’t do much better than just guessing. (Note: You can see my scores on the contest leaderboard.)  In the next section, we will add a cross-validation step and an hyperparameter tuning step to optimize the models and and see if that improves them.

This completes steps 4 to 6 of building a machine learning model. The remaining optimization step and the results are posted at How to create a machine learning model – Part 3.

NepalEarthquake1

Damage caused by Gorkha earthquake. Image from The Guardian

by George Taniwaki

At the beginning of each quarter edX and Microsoft offer a one-month long course called DAT 102x, Data Science Capstone. The class consists of a single machine learning project using real-world data. The class this past April used data collected by the Nepal government after the Gorkha earthquake in 2015. The earthquake killed nearly 9,000 people and left over 100,000 homeless.

The assignment was to predict damage level for individual buildings based on building characteristics such as age, height, location, construction materials, and use type. Below are the steps I used to solve this problem. The solution is general enough to apply to any machine learning problem. My description is a bit lengthy but shows the iterative nature of tuning a machine learning model.

About machine learning contests

The class project is operated as a contest. Students download training and test datasets, create a model using the training dataset, use the model to make predictions for the test dataset, submit their predictions, and see their scored results on a leaderboard.

As is common for machine learning contests, the training data consists of two files. The first file includes a large number of records (for the earthquake project, there were 10,000). Each record consists of an index and the independent variables (38 building parameters). A separate label file contains only two columns, the index and and their associated dependent variable(s) (in the earthquake project, there is only one, called damage_grade). The two files must be joined before creating a model.

The test file (10,000 more records of building indexes and parameters) has the same format as the training file. You use your model to create an output file consisting of the index and the predicted values of the dependent variable(s). You submit the file to a web service which then scores your results.

You can submit multiple predictions to be scored, adjusting your model after each submission in an attempt to improve your score. Your final score is based on the highest score achieved. To reduce the chance that competitors (students) overfit their model to the test data, the score is based on an undisclosed subset of records in the test file.

Approach to model building

The general approach to building a machine learning model is to first examine the dependent variables using univariate methods (step 1). Repeat for the independent variables (step 2). Normalize the variables (step 3). Examine correlations using multivariate methods (step 4).  Select the relevant variables, choose a model, and build it (step 5). Evaluate and test the model (step 6) and tune the parameters (step 7) to get the best fit without overfitting. Some iteration may be required.

Step 1: Univariate statistics for the dependent variable

There is one dependent variable, damage_grade, labeled with an integer from 1 to 3. Higher values mean worse damage. However, the intervals between each class are not constant, so the scale is ordinal not interval. Descriptive statistics are shown in Table 1 and Figure 1 below.

Table 1. Descriptive statistics for dependent variable

Variable name Min Median Max Mean Std dev
damage_grade 1 2 3 2.25 0.61

 

Figure 1. Histogram of dependent variable

Fig1HistDep

Step 2: Univariate statistics for the independent variables

As mentioned above, there are 38 building parameters. Details of the variables are given at Data Science Capstone. The 38 independent variables can be divided into 4 classes, binary, interval integer, float, and categorical as shown in Table 2 below. Notice that the parameter count_families is defined as a float even though it only takes on integer values.

Table 2. Summary of independent variables

Variable type Quantity Examples
Binary (Boolean) 22 has_superstructure_XXXX, has_secondary_use, has_secondary_use_XXXX
Integer, interval 4 count_floors_pre_eq, height, age, area
Float 1 count_families
Categorical 11 geo_level_1_id, geo_level_2_id, geo_level_3_id, land_surface_condition, foundation type, roof_type, ground_floor_type, other_floor_type, position, plan_configuration, legal_ownership_status

 

The binary variables fall into three groups. First is has_superstructure_XXXX, where XXXX can be 11 possible materials used to produce the building superstructure such as adobe mud, timber, bamboo, etc. The second is has_secondary_use_XXXX, where XXXX can be 10 possible secondary uses for the building such as agriculture, hotel, school, etc. Finally, has_secondary_use indicates if any has_secondary_use_XXXX variables is true.

Whenever I have groups of binary variables, I like to create new interval integer variables based on them. In this case, they are named count_superstructure and count_secondary_use which are a count of the number of true values for each. count_superstructure can vary from 0 to 11 while count_secondary_use can vary from 0 to 10.

For the 7 numerical parameters, their minimum, median, maximum, mean, standard deviation, and histogram are shown in Table 3 and Figure 2 below. Possible outliers, which occur in all 7 numerical variables, are highlighted in red.

Table 3. Descriptive statistics for the 7 numerical variables. Red indicates possible outliers

Variable name Min Median Max Mean Std dev
count_floors_pre_eq 1 2 9 2.15 0.74
height 1 5 30 4.65 1.79
age 0 15 995 25.39 64.48
area 6 34 425 38.44 21.27
count_families 0 1 7 0.98 0.42
count_superstructure 1 1 8 1.45 0.78
count_secondary_use 0 0 2 0.11 0.32

 

 

Figure 2. Histograms for the 7 numerical variables. Red indicates possible outliers

Fig2HistIndep

Upon inspection of the dataset, none of the numerical variables have missing values. However, it appears that for 40 records, age has an outlier values of 995. The next highest value is 200. Further, the lowest age value is zero, which does not work well with the log transform. To clean the data, I created a new binary variable named has_missing_age, with value = 1 if age = 995, else = 0 otherwise. I also created a new numerical variable named adjust_age, with value = 1 if age = 0, else = 15 (the median) if age = 995, else = age otherwise.

The variable area has a wide range, but does not appear to contain any outliers.

The variables count_families, count_superstructure, and count_secondary_use do not seem to have any outliers. They also do not need to be normalized.

For the 11 categorical variables, the number of categories (enums), and the names of the enums with the largest and smallest counts are shown in Table 4 below.

Table 4. Descriptive statistics for the 11 categorical variables. Red indicates one or more enums has fewer than 10 recorded instances

Variable name Count enums Max label / count Min label / count Comments
geo_level_1_id 31 0 / 903 30 / 7 hierarchical
geo_level_2_id 1137 0 / 157 tie / 1 hierarchical
geo_level_3_id 5172 7 / 24 tie / 1 hierarchical
land_surface_condition 3 d502 / 8311 2f15 / 347
foundation type 5 337f / 8489 bb5f / 48
roof_type 3 7g76 / 7007 67f9 / 579
ground_floor_type 5 b1b4 / 8118 bb5f / 10
other_floor_type 4 f962 / 6412 67f9 / 415 correlated with ground_floor_type
position 4 3356 / 7792 bcab / 477
plan_configuration 9 a779 / 9603 cb88 / 1 lumpy distribution
legal_ownership_status 4 c8e1 / 9627 bb5f / 61 lumpy distribution

 

None of the categorical or binary variables have missing values. None of the enums are empty, though some of the enums (highlighted in red in Table 4 above) have fewer than 10 recorded instance, so may bias the model. Unfortunately, we do not have any information about the meaning of the enum labels, so do not have a good way to group the sparse enums into larger categories. If this were a real-world problem I would take time to investigate this issue.

Step 3: Normalize the numerical variables

As can be seen in Figure 2 above, some of the independent numerical variables are skewed toward low values and have long tails. I normalized the distributions by creating new variables using the log transform function. The resulting variables have names prefixed with ln_. The descriptive statistics of the normalized variables are shown in Table 5 and Figure 3 below.

Table 5. Descriptive statistics for adjusted and normalized numerical parameters

Variable name Min Median Max Mean Std dev
ln_count_floors_pre_eq 0 0.69 2.19 0.70 0.36
ln_height 0 1.61 3.40 1.46 0.40
ln_adjust_age 0 2.71 5.30 2.61 1.12
ln_area 1.79 3.53 6.05 3.54 0.46

 

Figure 3. Histograms of adjusted and normalized numerical parameters

Fig3HistIndep

This completes the first 3 steps of building a machine learning model. Steps 4 to 6 to solve this contest problem are posted at How to create a machine learning model – Part 2. The final optimization step 7 is posted at Part 3.

MsftBigData

by George Taniwaki

Big data and machine learning are all the rage now. Articles in the popular press inform us that anyone who can master the skills needed to turn giant piles of previously unexplored data into golden nuggets of business insight can write their own ticket to a fun and remunerative career (efinancialcareers May 2017).

Conversely, the press also tells us that if we don’t learn these skills a computer will take our job (USA Today Mar 2014). I will have a lot more to say about changes in employment and income during the industrial revolution in future blog posts.

But how do you learn to become a data scientist. And which software stack should one specialize in? There are many tools to choose from. Since I live in the Seattle area and do a lot of work for Microsoft, I decided to do take an online class developed and sponsored by Microsoft and edX. Completion of the course leads to a Microsoft Data Science Certificate.

The program consists of 10 courses with some choices, like conducting analysis using either Excel or Power BI, and programming using either R or Python. Other parts of the Microsoft stack you will learn include SQL Server for queries and Microsoft Azure Machine Learning (MAML) for analysis and visualization of results. The courses are priced about $99 each. You can audit them for free if you don’t care about the certificates.

I started the program in February and am about half way done. In case any clients or potential employers are interested in my credentials, my progress is shown below.

DAT101x – Data Science Orientation

If you haven’t been in college in a while or have never taken an online class, this is a good introduction to online learning. The homework consists of some simple statistics and visualization problems.

Time: 3 hours for 3 modules

Score: 100% on 3 assignments

DAT101x Score    DAT101x Certificate

DAT201x – Querying with Transact-SQL

I took a t-SQL class online at Bellevue College two years ago. Taking a class with a real teacher, even one you never meet, was a significantly better experience than a self-paced mooc. This course starts with the basics like select, subqueries, and variables. It also covers intermediate topics like programming, expressions, stored procedures, and error handling. I did my homework using both a local instance of SQL Server and on an Azure SQL database.

Time: 20 hours for 11 modules

Score: I missed one question in the homework and two in the final exam for a combined score of 94%

DAT201x Score     DAT201x Certificate

DAT207x – Analyzing and Visualizing Data with Power BI

I already have experience creating reports using Power BI. I also use Power Query (now called get and transform data) and M language and Power Pivot and DAX language, so this was an easy class.

The course covers data transforms, modeling, visualization, Power BI web service, organization packs, security and groups. It also touches on the developer API and building mobile apps.

Time: 12 hours for 9 modules

Score: I missed one lab question for a combined score of 98%

DAT207x Score     DAT207x Certificate

DAT222x – Essential Statistics for Data Analysis using Excel

This class is comprehensive and covers all the standard statistics and probability topics including descriptive statistics, Bayes rule, random variables, central limit theorem, sampling and confidence interval, and hypothesis testing. Most analysis is conducted using the Data analysis pack add-in for Excel.

Time: I used to work in market research, so I know my statistics. However, there are 36 homework assignments and it took me over 20 hours to complete the 5 modules.

Score: I missed 9 questions on the quizzes (88%) and six in the final exam (81%) for a combined score of 86%. (Despite the time it takes to complete, homework counts very little toward the final grade)

DAT222x Score     DAT222x Certificate

DAT204x – Introduction to R for Data Science

Now we are getting into the meat of the program. R is a functional language. In many ways it is similar to the M language used in Power Query. I was able to quickly learn the syntax and grasp the core concepts.

The course covers vectors, matrices, factors, lists, data frames, and simple graphics.

The lab assignments use DataCamp which has a script window where you write code and a console window that displays results. That makes it easy to debug programs as you write them.

The final exam used an unexpected format. It was timed and consisted of about 50 questions, mostly fill-in-the-blank responses that include code snippets. You are given 4 minutes per question. If you don’t answer within the time limit, it goes to the next question. I completed the test in about 70 minutes, but I ran out of time on several questions, and was exhausted at the end. I’m not convinced that a timed test is the best way to measure subject mastery by a beginning programmer. But maybe that is just rationalization on my part.

Time: 15 hours for 7 modules

Score: I got all the exercises (ungraded) and labs right and missed two questions in the quizzes. I only got 74% on the final, for a combined score of 88%

DAT204x Score     DAT204x Certificate

DAT203.1x Data Science Essentials

The first three modules in this course covered statistics and was mostly a repeat of the material introduced in DAT222x. But the rest of the course provides an excellent introduction to machine learning. You learn how to create a MAML instance, import a SQL query, manipulate it using R or Python, create a model, score it, publish it as a web service, and use the web service to append predictions as a column in Excel. I really like MAML. I will post a review of my experience in a future blog post.

The course was a little too cookbook-like for my taste. It consisted mostly of following directions to drag-drop boxes onto the canvas UI and copy-paste code snippets into the panels. However, if you want a quick introduction to machine learning without having to dig into the details of SQL, R, or Python, this is a great course.

Time: 10 hours for 6 modules

Score: 100% on the 6 labs and the final

DAT203.1x Score     DAT203.1x Certificate

I have now completed six out of the ten courses required for a certificate. I expect to finish the remaining 4 needed for a certificate by the end of the year. I will also probably take some of the other elective courses simply to learn more about Microsoft’s other machine learning and cloud services.

For my results in the remaining classes, see Microsoft Data Science Certificate-Part 2

Update: Modified the description of the final exam for DAT204x.

by George Taniwaki

NASA recently celebrated the fifteenth anniversary of the launching of the Chandra X-ray Observatory by releasing several new images. One of the images, shown below, is an amazing composite that reveals in exquisite detail the turbulence surrounding the remnants of the Tycho supernova. (Scroll down to Figure 2, then come back.)

Tycho supernova

The scientific name of the Tycho supernova remnant is SN 1572, where SN means supernova and 1572 refers to the year it was first observed. That’s right, over 400 years ago, in November 1572, many people noticed a new bright object in the sky near the constellation Cassiopeia. Reports at the time indicated that it was as bright as Venus (peak magnitude of –4) meaning it was visible during the day.

SN 1572 is called the Tycho supernova because a Danish scientist named Tycho Brahe published a paper detailing his observations. His paper is considered one of the most important in the history of astronomy, and science in the Renaissance.

Tycho_Cas_SN1572

Figure 1. Star map drawn by Tycho Brahe showing position of SN 1572 (labelled I) within the constellation Cassiopeia. Image from Wikipedia

What people at the time didn’t know was that SN 1572 was about 9,000 light years away, meaning it was unimaginably far away. The explosion that caused it happened long ago but the light had just reached the earth.

(Actually, SN 1572 is fairly close to us relative to the size of the Milky Way which is 100,000 light years across, and extremely close relative to the size of the observable universe which is 29 billion light years across. Space is just really unimaginably large.)

What they also didn’t know was that SN 1572 was probably a Type 1a supernova. This type of supernova is common, and has a very specific cause. It starts with a binary star system. Two stars orbit one another very closely. Over time, one of the stars consumes all of its hydrogen and dies out, leaving a carbon-oxygen core. Its gravity causes it to accrete the gas surrounding it until its mass reaches what is called the Chandrasekhar limit and it collapses. The increased pressure causes carbon fusion to start. This results in a runaway reaction, causing the star to explode.

About a supernova remnant

In the 400 years since SN1572 exploded, the debris from it has been flying away at 5,000 km/s (3100 mi/s). It is hard to see this debris. Imagine a large explosion on the earth that occurs at night.

The debris itself doesn’t generate very much light, but it does produce some. Space is not a vacuum. It is a very thin gas. When electrons from the moving debris of the supernova remnant strike a stationary particle, it gives off a photon (which depending on the energy of the collision, is seen as radio waves, microwaves, visible light, UV, or x-rays). This energy also heats up the remaining particles, releasing additional photons, making them detectable with a very sensitive telescope.

About false color images

The Chandra X-ray Observatory was launched in 1999 from the space shuttle Columbia. As the name implies, it can take digital images of objects in the x-ray range of light. Since humans cannot see in this range, images taken in the x-ray range are often color coded in the range from green to blue to purple.

Often, composite images of space objects are created using telescopes designed to capture photons from different wavelengths. For instance, visible light telescopes like the Hubble Space Telescope often have the colors in their images compressed to black and white. Images from infrared telescopes, like the Spitzer Space Telescope, and ground-based radio telescopes are often given a false color range between red to orange.

Pictures please

All right, finally the results. Below is the most detailed image ever of the Tycho supernova remnant. It is a composite created by layering multiple, long-exposure, high-resolution images from the Chandra X-ray Observatory. The press release says, “The outer shock has produced a rapidly moving shell of extremely high-energy electrons (blue), and the reverse shock has heated the expanding debris to millions of degrees (red and green).

“This composite image of the Tycho supernova remnant combines X-ray and infrared observations obtained with NASA’s Chandra X-ray Observatory and Spitzer Space Telescope, respectively, and the Calar Alto observatory, Spain.

“The explosion has left a blazing hot cloud of expanding debris (green and yellow) visible in X-rays. The location of ultra-energetic electrons in the blast’s outer shock wave can also be seen in X-rays (the circular blue line). Newly synthesized dust in the ejected material and heated pre-existing dust from the area around the supernova radiate at infrared wavelengths of 24 microns (red).”

Tycho2014

Figure 2. Tycho supernova remnant composite image released in 2014. Image from NASA

Compare Figure 2 above to an image of the Tycho supernova remnant that NASA released in 2009 using data from observations made in 2003 shown below. Notice the lack of details. Also notice the large number of stars in the background, some even shining through the dust of the explosion. Apparently, the image above has been modified to eliminate most of these distractions.

These two images dated only a few years apart reveal what is likely remarkable advances in software for manipulating space images. I say that because the hardware in the telescopes themselves, such as optics, detectors, and transmitters probably have not changed much since launch. Thus, any improvements in resolution and contrast between the two images is a result of better capabilities of the software used to process images after the raw data is collected.

A New View of Tycho's Supernova Remnant

Figure 3. Tycho supernova remnant composite image release in 2009. Image from NASA

by George Taniwaki

Your smartphone is more than an addictive toy. With simple modifications, it can become a lifesaving medical device. The phone can already receive and send data to medical sensors and controllers wirelessly. By adding the right software, a smartphone can do a better job than a more expensive standalone hospital-grade machine.

In addition, smartphones are portable and patients can be trained to use them outside a clinical setting. The spread of smartphones has the potential to revolutionize the treatment of chronic conditions like diabetes. This can enhance the quality of life of the patient and significantly increase survival.

Monitoring blood sugar

Type 1 diabetes mellitus is an autoimmune disease in which the body attacks the pancreas and interrupts the production of insulin. Insulin is a hormone that causes the cells in the body to absorb glucose (a type of sugar) from the blood and metabolize it. Blood sugar must be controlled to a very tight range to stay healthy.

A lack of insulin after meals can lead to persistent and repeated episodes of high blood sugar, called hyperglycemia. This in turn can lead to complications such as damage to nerves, blood vessels, and organs, including the kidneys. Too much insulin can deplete glucose from the blood, a situation called hypoglycemia. This can cause dizziness, seizures, unconsciousness, cardiac arrhythmias, and even brain damage or death.

Back when I was growing up (the 1970s), patients with type 1 diabetes had to prick their finger several times a day to get a blood sample and determine if their glucose level was too low or too high. If it was too low, they had to eat a snack or meal. (But not one containing sugar.)

They would also test themselves about an hour after each meal. Often, their glucose level was too high, and they had to calculate the correct does of insulin to self-inject into their abdomen, arm, or leg to reduce it. If they  were noncompliant (forgetful, busy, unable to afford the medication, fearful or distrustful of medical institutions or personnel, etc.), they would eventually undergo diabetic ketoacidosis, which often would require a hospital stay to treat.

BloodGlucoseTestStrip

Figure 1a. Example of blood glucose test strip. Photo from Mistry Medical

InsulinShot

Figure 1b. Boy demonstrating how to inject insulin in his leg. Photo from Science Photo Library

If all these needle pricks and shots sound painful and tedious, they were and still are. There are better test devices available now and better insulin injectors, but they still rely on a patient being diligent and awake.

Yes, being awake is a problem. It is not realistic to ask a patient to wake up several times a night to monitoring her glucose level and inject herself with insulin. So most patients give themselves an injection just before going to bed and hope they don’t give themselves too much and that it will last all night.

Continuous glucose monitoring

Taking a blood sample seven or eight times a day is a hassle. But even then, it doesn’t provide information about how quickly or how much a patient’s glucose level changes after a meal, after exercise, or while sleeping.

More frequent measurements would be needed to estimate the rate at which a patient’s glucose level would likely rise or fall after a meal, exercise, or sleeping. Knowing the rate would allow the patient to inject insulin before the glucose level was outside the safe range or reduce the background dosage if it is too high.

In the 1980s, the first continuous glucose meters were developed to help estimate the correct background dosage of insulin and the correct additional amounts to inject after snacks and meals.

The early devices  were bulky and hard to use. They consisted of a sensor that was inserted under the skin (usually in the abdomen) during a doctor visit and had wires that connected it to a monitoring device that the patient carried around her waist. The sensor reported the glucose level every five to ten seconds and the monitor had enough memory to store the average reading every five to ten minutes over the course of a week.

The devices were not very accurate and had to be calibrated using the blood prick method several times a day. The patient would also have to keep a paper diary of the times of meals, medication, snacks, exercise, and sleep. After a week, the patient would return to the doctor to have the sensor removed.

The doctor would then have to interpret the results and calculate an estimated required background dose of insulin during the day and during the night and the correct amount of additional injections after snacks and meals. The patient would repeat the process every year or so to ensure the insulin dosages were keeping the glucose levels within the desired range.

Today, continuous glucose monitors can measure glucose levels using a disposable sensor patch on the skin that will stay in place for a week. It transmits data to the monitor wirelessly. Using a keypad, the monitor can also record eating, medication, exercise, and sleeping. The monitor can store months of personal data and calculate the amount of insulin needed in real-time. Alerts remind the patient when to inject insulin and how much. They are cheap enough and portable enough that the patient never stops wearing it.

ContinuousGlucoseMonitor

Figure 2. Wireless continuous blood glucose monitor and display device. Image from Diabetes Healthy Solutions

Continuous insulin pump

Also in the 1980s, the first generation of subcutaneous insulin pumps were commercialized. These pumps could supply a low background dose of insulin rather than big spikes provided by manual injections. The first pumps were expensive, bulky, hard to use. By the early 2000s though, insulin pumps became widely available and were shown to reliably reduce the fluctuations in glucose levels seen in patients who relied on manual injections. By providing a low dose of insulin continuously during the day and at night with the ability of the patient to manually apply larger doses after meals, it lowered the average level of glucose while also reducing the incidence of hypoglycemia. Over longer periods it also reduced the incidence of complications commonly seen with diabetes.

InsulinPumpEarlyinsulinpump

Figure 3a and 3b. Early insulin pump (left) and modern version (right). Images from Medtronic

There is one drawback to the continuous insulin pump. It can provide too much insulin at night while the patient is asleep. While sleeping, the patient’s glucose level falls. Since she is not performing blood tests, she will not notice that the insulin pump is set too high. Further, since she is asleep she may not realize that she is in danger, a condition called nocturnal hypoglycemia.

Software to control the pump

Imagine combining the continuous glucose meter with the continuous insulin pump. Now you have a system the mimics the behavior of the human pancreas. Sensors constantly monitor the patient’s glucose level, and anticipate changes caused by activities like eating, sleeping, and exercise.

The key is to use a well-written algorithm to predict the amount of insulin needed to be injected by the pump to keep sugar levels within the acceptable range. Instead of a human, software controls the insulin pump. If the glucose level does not stay within the desired levels, the algorithm learns its mistake and corrects it.

The initial goal of the combined monitor and pump was to predict low glucose levels while a patient was sleeping and suspend the pumping of insulin to prevent nocturnal hypoglycemia. Ironically, the US FDA panel rejected the first application submitted for the device saying that the traditional uncontrolled continuous insulin pump was actually safer than a new device because of the new device’s lack of field experience.

After years of additional studies the combined device, manufactured by Medtronic, was approved for use in the US in 2013. Results of a study involving 25 patients in the UK was published in Lancet Jun 2014. Another trial, involving 95 patients in Australia was published in J. Amer. Med. Assoc. Sept 2013.

CombinedGlucoseMonitorInsulinPump

Figure 4. Combined glucose meter and insulin pump form a bionic pancreas. Image from Medtronic

Better software and smartphones

The Medtronic combined device is proprietary. But several groups are hacking it to make improvements. For instance, researchers led by Z. Mahmoudi and M. Jensen at Aalborg University in Denmark have published several papers (Diabetes Techn Ther Jun 2014Diabetes Sci Techn Apr 2014, Diabetes Techn Ther Oct 2013) on control algorithms that may be superior to the one currently used in the Medtronic device.

Another interesting paper appeared in the New Engl J Med Jun 2014. It reports a study by Dr. Steven Russell of Massachusetts General Hospital and his colleagues. They wrote an app for a smartphone (Apple’s iPhone 4S) that could receive the wireless data from the Medtronic glucose meter and wirelessly control the Medtronic insulin pump.

Smartphones are ideal platforms for use in developing medical devices because they can communicate wirelessly with other devices, have sufficient computing power and memory for even the most complex control tasks, are designed to be easy to program and easy to use, and many people already own one.

Dr. Russell and his colleagues used a machine learning algorithm they had previously developed (J Clin Endocrinol Metab May 2014) to couple the two.

Even though this is a research project, not a commercial product, the results are pretty impressive. The study lasted 5 days, with the first day used to calibrate the algorithm and days 2-5 as the test.

As can be seen in Figure 5, after a day of “training” patients using the bionic pancreas (solid black line) had lower average glucose levels than patients on the standard protocol (solid red line). Further, the variance of their glucose level (black shaded area) was smaller than for patients on the standard protocol (red shaded area). Notice how much better the control is using the bionic pancreas, especially at night.

InsulinNEJM1

Figure 5. Variation in mean glucose level among adults during 5-day study. Image from New Engl J Med

Another measure of quality is the amount of time the patients’ glucose levels were within the desired level of 70 to 120 mg/dl (the green shaded region in Figure 6). Patients with the bionic pancreas (solid black line) spent about 55% of the time within the desired level. They also had fewer incidents of hypoglycemia (pink shaded region) or hyperglycemia (white region on right) than patients using the standard protocol (red line).

Note that even with the bionic pancreas, 15% of the time patients had a glucose level above 180, so there is still plenty of room to improve control.

InsulinNEJM

Figure 6. Cumulative glucose level in adults during day 1 where the bionic pancreas adapted to the patient (dashed line) and days 2-5 (solid black). Image from New Engl J Med

by George Taniwaki

If you have ever taken a picture through an exterior window, you may have been disappointed with the results. This can happen either standing inside a building shooting out, or standing on the street shooting in. That’s because if the side you are standing on is brighter than the other side you will get reflection of the lights off the glass. And if you are standing in the light, you will also get the reflection of yourself which just looks goofy.

Another problem is that if it is a sunny day, it will be much brighter outside than inside, so all of the objects inside the building will be too dark relative to the objects outside.

Photographer Nick Kelsh has a blog post where he gives a good description of the problem and some tips to avoid it. Though his recommendation to throw a rock through the window may be a bit extreme.

You can try to fix the problem using Photoshop, but it is a lot of work because you will need to fix the color balance for two different parts of the scene. The interior lighting will probably be an artificial source like incandescent (2700K) or fluorescent (5000K) while the exterior will be lit by the sun (~10,000K). Removing the reflections will be harder still.

A better solution, if you can control the interior lights, is to take two pictures and combine them. The first picture is taken with the lights on and is designed to capture the object inside the building. The second picture is taken with the lights off and is designed to capture the view outside the window. (This assumes the lights are inside.)

Then you merge the two images. If the window is rectangular, you don’t even need fancy equipment and software to perform this trick. In the example I show below, I didn’t use a tripod, just a handheld smartphone. To manipulate the images I used  free Google Picasa to adjust the color balance and brightness of the two images. I copied and pasted the image within the window using free Microsoft Paint.

Take two pictures

The first step is to take two pictures out of the window without moving the camera, one with the lights on and one with the lights off. For this example, I am standing inside the house. When taking the picture with the lights on, I set my focus and exposure to capture the most prominent object(s) inside the house. I ignore the reflections in the window (Fig 1).

Conversely, when taking the picture with the lights off, I set my focus and exposure to capture the view though the window. I ignore the underexposure and poor lighting of objects inside the house (Fig 2).

LightsOnLightsOff

Figures 1 and 2. Image with interior lights on (left) and lights off (right)

For best results, you will want to mount the camera on a tripod to eliminate any movement between the two images and any blur caused by long exposure during the lights off photo. However, if the window is rectangular and there is no overlap between objects in the foreground and the window, it doesn’t really matter. In fact, the images don’t even have to be out the same window.

Adjust the brightness and color balance separately

Open each image individually in your favorite photo editor software package. I use Google Picasa because it is free. Set the color balance, brightness, and contrast. I dislike the yellowish tone of photographs taken with incandescent lights and always try to correct them to fluorescent. When taking photos with a low quality camera, such as most smartphones, also apply despeckle and unsharp masking. Save both images.

Copy and paste

You can use Google Picasa to crop an image and copy it to the clipboard. However, it doesn’t appear to have a way to paste the clipboard into an image. Thus, we will merge the two images using free Microsoft Paint. Open the image that has the good view through the window. Create a selection rectangle around the image within the window frame and copy it to the clipboard. Open the image that has the image of objects except the window and paste. If the images don’t overlap properly because they are different sizes, start over and resize the first image to match the second. Save under a new name (so you don’t ruin the original image).

The resulting image (Fig 3) looks pretty good, but it isn’t perfect. First, because I was using a smartphone, the focal length of the lens is very short so the sides of the windows are curved rather than straight. Since I used a rectangular selection to cut and paste, parts of the original window image show through. For instance, you can see the reflection of a light peeking through in the upper right corner of the window.

Second, the gooseneck of the sink faucet is within the rectangular selection so it is too dark.

Finally, since I didn’t use a tripod, there is some movement and the gooseneck of the sink faucet is slightly displaced.

KItchenWindows_13

Figure 3. The merged image, color corrected and cropped

Using Photoshop or other advanced image editor, you can make a better selection. And you don’t have to use the image out the window. Once you have created a mask, you can paste any image into the scene.

All photographs by George Taniwaki