Animated Covid-19 map, screenshot from Domo

by George Taniwaki

In order to make predictions about the future trajectory of the spread of Covid-19, you need to be able make sense of the currently available data. There are several steps to get good data.

Medical event data

First, you have to be able to collect data from multiple sources, clean them, and aggregate them based on a standard criteria. Each data record could include the following elements:

  1. Event (what was counted, e.g., tests administered, positive test results, negative results, hospital admissions, ICU status, ventilation status, discharges, recoveries, deaths, etc.)
  2. Location ID (where the event occurred, see below)
  3. Date of incidence (when the event occurred)
  4. Date of reporting (sometimes data is reported days or even months after the event and can be updated many times as errors are corrected or missing data is estimated)
  5. Value (a count)

The best repository of Covid-19 data is maintained by the New York Times (on GitHub) with an interactive viewer. Johns Hopkins University Coronavirus Resource Center also has a dataset. The best source for counts of tests in the U.S. is available from the Covid Tracking Project sponsored by the Atlantic.


One of several graphics available from the New York Times

Public policy change data

In addition to medical events, there are public policy events that can be tracked, such as government orders to close nonessential businesses, travel restrictions, and so forth. These records could include the following elements:

  1. Event (what type of public policy change was made)
  2. Location ID (where the change applies to, see below)
  3. Date of incidence (when the change was implemented)
  4. Date of reporting (when change was reported, usually before the change is implemented)

Unfortunately, I could not find a centralized source of information on government restrictions and the dates they became effective. A different source of information that can help indicate how much contact there is between people is the amount of movement by people who carry smartphones. Smartphones contain a GPS antenna and can report their position. The position can be used to indicate what type of activity the person is engaging in. Google Health has a community mobility report that is updated regularly. An example report is shown below and the data in .csv format is available for download.


Among those who own Android smartphones and participate in tracking, trips have declined. Screenshot from Google Health

Demographic and geographic data

To analyze the data, you will want append demographic and geographic data about the locations. Unlike events, demographic and geographic data changes slowly, so only needs to be collected once during the model building process. The following data elements could be useful to prepare a model of forecast:

  1. Location ID (from above)
  2. Name or description
  3. Location hierarchy (continent > country > region > state > county > city > zip code, etc.)
  4. Latitude and longitude of centroid
  5. Latitude and longitude of center of largest city
  6. Surface area (km3)
  7. Total population
  8. Age distribution
  9. Gender distribution
  10. Income distribution
  11. Race distribution
  12. Political party affiliation distribution
  13. Health insurance coverage distribution
  14. Comorbidity distribution (smoking, diabetes, etc.)
  15. Number of hospitals
  16. Number of hospital beds
  17. Number of ICU beds
  18. Number of ventilators

Some good sources for this type of data are US Census, United Nations Demographic Year Book, United Nations Development Programme’s (UNDP) Human Development Report and the World Bank’s World Development Report, Gapminder, and ESRI.

Visualize the data

Once the data is aggregated, there are many ways to visualize it. Maps are an obvious way to display location data. Line charts are an obvious way to display time series data. Domo, a developer of business intelligence software, has very nice animation that displays time series data on a map (screenshot at top of blog).

Two caveats about their display. First, the number of cases is underreported because testing for infection was not widespread early in the pandemic, and is still too low today.

Second, outside the U.S. the data is by reported by country, not state or other smaller region. A single marker is used to represent the location of events. This is probably fine for Europe or Africa, where countries tend to be small. However, it is misleading for larger countries like Canada, Russia, China, Indonesia, Australia, and Brazil. Even data for a states like California is distorted because one would expect separate markers for the Bay Area and for the LA Basin instead of a single one in the middle of the state.

Johns Hopkins Center for Systems Science and Engineering has produced a nice dashboard hosted on ArcGIS (screenshot below). It does a better job of dividing large countries into smaller geographic partitions, but the colors are dark. A description of the project was published in Lancet Infect Dis (Feb 2020) and in a press release (Jan 2020). All of the data and the dashboard are available in a GitHub repository.


Another example of a Covid-19 map. Screenshot from ArcGIS

A note about line charts. You often see Covid-19 growth charts by country that display time (either calendar date, or days since the nth event occurred) on the horizontal axis and count on the vertical axis. Both are scaled linearly. I find these charts hard to interpret and compare. I think a better way to display growth data is to display data on the vertical axis using logarithm of counts per 100,000 population and on the horizontal axis using days since the n*(population/100,000)th event occurred. Even better would be to divide large countries into smaller regions so that all the charts covered regions with similar populations.

Making Forecasts

There are many groups making forecasting of Covid-19 infection rates and death rates. The CDC has a summary of them along with its own ensemble forecast. It predicts under 100,000 deaths in the U.S. at the end of May. The Institute of Health Metrics and Evaluation (IHME) predicts about 72,000 total deaths at the end of May but with a range from 60,000 to 115,000. You can download the data from the Global Health Data Exchange.

In addition to forecasting deaths, the IHME forecasts hospital utilization. These forecasts are used by hospitals to schedule resources and plan for peak usage.


Individual forecasts of cumulative reported deaths in U.S. from Covid-19 (left) and CDC ensemble forecast (right). Image from CDC


Cumulative death forecast in U.S. Image from IHME.

One of the best forecasts I have seen was produced by the Economist. It synthesizes data from US Census, New York Times, Covid Tracking Project, IHME, Google Health, and Unacast. The choropleth map of the U.S. below shows risk factors for Covid-19 mortality at the county level. Green shows areas where the risk level is low (less than 1%) and red shows high (6% or above).


Dixie in the crosshairs. Image from Economist

* * * *

Update1: In just one day, the IHME forecast is obsolete. See my response at

Update2: Add link to New York Times dataset and interactive viewer


Keeping my blog alive

by George Taniwaki

I’ve been writing this blog since 2007. For most of that time I’ve been using a text editor called Windows Live Writer. It was part of a bundle of free apps called Windows Live Essentials that Microsoft distributed to enhance the value of Windows and the .NET Framework. The last upgrade was released in 2012 and the product was discontinued a few years later.

I just bought a new PC and could not install Windows Live Writer on it. I was somewhat concerned how to continue editing this blog. I suppose I could learn how to use the new online WordPress editor that uses a format and editing technique called blocks.

However, I’m old and set in my ways. Learning to use yet another text editor seems like a lot of work. Plus, there doesn’t seem a way to convert my existing .wpost files to the new WordPress format. And I prefer using a dedicated client app to an online browser app, even if it is performant.

Luckily, there are lot of people like me. I found Open Live Writer. As stated in Wikipedia, Open Live Writer is a free and open-source version of Windows Live Writer. It is supported by the .NET Foundation. And the installer works on my new PC. Yay.


Folding a square sheet of paper

by George Taniwaki

After scribbling notes on a square yellow Post-it note, I folded it in half and was about to put it in my pocket when I made an interesting observation. If I fold a square paper exactly in half while holding an edge toward me, the resulting visible area is a rectangle with an area of one-half the total area (top illustration above). The same is true if rotate the sheet 45 degrees and I fold it along the diagonal to make a triangle (middle illustration above). However, if I fold it in half along any other angle, the area of the top layer is exactly one-half, but two additional triangular ears are visible from under the fold (bottom illustration above).

That made me wonder. What angle produces the maximum area, and what is that area? This is the perfect math question for the origami enthusiast.

Let’s solve it.

Finding the relevant dimensions

The first step in solving this problem is to unfold the sheet, use geometry to find any symmetries, and use trigonometry to calculate the area of the two triangular ears as a function of the rotation angle, θ.


Unfolding the sheet

By unfolding the sheet, I discover a surprising symmetry. The two triangular ears are identical. Let’s call the lengths of the two exterior sides A and B. The area of one triangular ear is A * B / 2 and the area of both ears is A * B.

The length of the interior side, which is the hypotenuse of a right triangle, can be called C. The angles opposite each side can be labeled a, b, and c. (See figure above.)

Using trigonometry, this means A = C * cos(b) and B = C * sin(b).

The angle b is twice the rotation angle or 2θ. Substituting gives

A = C * cos(2θ) and B = C * sin(2θ).

The length of one side of the square = 1, so A + B + C = 1. Substituting and rearranging gives

C * cos(2θ) + C * sin(2θ) + C = 1


C = 1/(cos(2θ) + sin(2θ) + 1)

and finally substituting this value of C in our original equations for A, B, and A * B gives

A = cos(2θ) / (cos(2θ) + sin(2θ) + 1)
B = sin(2θ) / (cos(2θ) + sin(2θ) + 1)
A * B = cos(2θ) * sin(2θ) / (cos(2θ) + sin(2θ) + 1)2

Numeric solution

There are two ways to solve for the maximum value of the area A * B and the angle θ. We can solve it numerically and analytically. Let’s start with a numeric solution. In Excel, I create a table of values of θ from 0 to 45 degrees with steps of 1 degree increments. I convert degrees to radians and calculate the associated values of A, B, and A *B. Then I plot them. The maximum value occurs at θ somewhere between 22 and 23 degrees. Thus, it seems 22.5 degrees is the solution.

At 22.5 degrees, A = B = 0.2929 and A * B = 0.08579.


Plot of A, B, and A*B versus θ

Analytic solution

Now let’s prove the numeric solution is correct by using an analytical method. To find the maximum area, I take the derivative of A * B with respect to θ and set to zero. Then solve for θ. The solution uses basic calculus, including a combination of the product rule and chain rule. First, I define A * B as a group of functions as follows:

let f(θ) = cos(2θ)
g(θ) = sin(2θ)
h(θ) = 1 / (cos(2θ) + sin(2θ) + 1)2 = (cos(2θ) + sin(2θ) + 1)2


A * B = f(θ) * g(θ) * h(θ)

The derivatives are as follows:

f'(θ) = -sin(2θ)*2
g'(θ) = cos(2θ)*2
h'(θ) = -2 * (cos(2θ) + sin(2θ) + 1)-3 * (-sin(2θ) + cos(2θ)) * 2 = 4 * (sin(2θ) – cos(2θ)) / (cos(2θ) + sin(2θ) + 1)3


(A * B)’ = ( f'(θ) * g(θ) * h(θ)) + (g'(θ) * f(θ) * h(θ)) + (h'(θ) * f(θ) * g(θ))

Now I substitute the values of the 3 functions and their derivatives into the last equation.

(A * B)’ = (-sin(2θ)*2 * sin(2θ) * h(θ)) + (cos(2θ)*2 * cos(2θ) * h(θ)) + 4 * (sin(2θ) – cos(2θ)) / (cos(2θ) + sin(2θ) + 1)3 * cos(2θ) * sin(2θ))

Simplifying and rearranging gets:

(A * B)’ = (cos2(2θ) – sin2(2θ)) * 2 * h(θ) + (sin(2θ) – cos(2θ)) * 4 * cos(2θ) * sin(2θ) / (cos(2θ) + sin(2θ) + 1)3

The solutions are values of where A * B’ = 0. This is only true when both cos2(2θ) – sin2(2θ) = 0 and sin(2θ) – cos(2θ) = 0. The only angle where this condition is met is 45° or θ = 22.5°. This proves that the maximum that I found numerically is a maximum area.

[Update: Completed the analytic solution section.]


by George Taniwaki

This post describes the final six classes I took to obtain the Microsoft Artificial Intelligence Certificate. Four were required and two were optional. For a description of the first six classes I took, see this May 2019 blog post.

DAT236x – Deep Learning Explained

Deep learning is the use of machine learning on large datasets, often using neural networks. It is used in fields such as computer vision, speech recognition, and language processing (topics covered in more detail in later classes). Techniques include logistic regression, multilayer perceptron, convolutional neural networks, recurrent neural networks, and long short-term memory.

Time: 10 hours on 6 modules

Score: Missed 4 homework questions and 2 knowledge check questions for a score of 92%

DAT236x Score  DAT236x Certificate

DAT257x – Reinforcement Learning Explained

Reinforcement learning assumes a problem can be modeled as a Markov decision process. There is a set of discrete states (S), an agent that can perform a set of actions (A) selected from a set of decision policies (Π). Each possible action will result in a reward (R) and a new state (S’). The goal is to find the optimum policy π(s) for all s in S or to determine if a given policy is optimal.

Solutions to the reinforcement learning problem include use of multi-arm bandits, regret minimization, dynamic programming (Bellman equation), policy evaluation and optimization, linear function approximation, deep neural networks, and deep Q-learning. Advanced topics include likelihood ratio methods, variance reduction, and actor-critic.

I have an undergraduate degree in chemical engineering where I learned about control theory and Markov chains. However, that coursework only covered analog PID controllers. The topics in this class were new to me and so it was slow going.

Time: 16 hours for 10 modules

Score: Missed 7 knowledge check questions early on, then slowed down and got the rest right. Missed 3 lab questions. Final score of 91%

DAT257x Score  DAT257x Certificate

DEV287x – Speech Recognition Systems

Speech recognition is an interdisciplinary activity that combines signal processing, acoustics, linguistics, and domain knowledge with computer science. The topics covered in this course include:

  1. Fundamental theory – Phonetics, words and syntax, performance metrics
  2. Speech signal processing – Feature extraction, mel filtering, log compression, Feature normalization
  3. Acoustic modeling – Markov chains, feedforward deep neural networks, sequence based objective function
  4. Language modelingN-gram models, language model evaluation (likelihood, entropy, perplexity), LM operations (n-gram pruning, interpolating probabilities, merging), class-based LMs, neural network LMs
  5. Speech decoding – Weighted finite state transducers, WFST and acceptors, graph composition
  6. Advanced topics – Improved objective functions, sequential objective function, sequence discriminative objective functions

This class was pretty awful and I’m glad I didn’t pay for it. It consists mostly of text displayed in the edX courseware. It would have been helpful to if video or audio lectures were included to show voice recognition in action. The text itself was a split into multiple web pages containing embedded MathML equations, making is unsearchable. I ended up copying and pasting all the text into a Word document.

The lab assignments in this class are provided as Python files designed for Linux. Some labs require a Linux shell and would not run in Visual Studio on Windows. I would expect instructors in a Microsoft sponsored course to design lessons that could run on Windows. Simply putting the code in Jupyter notebooks would have made it easier to read and to work with.

Some labs require a Python package called OpenFST that does not compile with the latest build tools available from Microsoft. Again, I would expect instructors to design lessons that could run on Windows.

Time: 6 hours for 6 modules

Score: None, I did not take this class for credit

DEV287x Score

DEV288x – Natural Language Processing (NLP)

Natural language processing consists of many separate but related tasks. These include transcription, translation, conversation, and image captioning.

Machine translation has evolved from conventional statistical machine translation (STM) that uses hand-coded phrase pairs, to neural machine translation that use deep neural networks to create end-to-end sequence probabilities to translate entire sentences at a time.

The deep semantic similarity model (DSSM) is a DNN model for representing text strings as vectors. It can be used for information retrieval and entity ranking tasks.

Natural language understanding requires spoken language processing, continuous word representations, neural knowledgebase embedding, and KB-based question answering. NLP can be enhanced using deep reinforcement learning.

Finally, image captioning requires multimodal intelligence, combining image recognition and assigning labels to images in a natural language format.

Time: 10 hours for 6 modules

Score: None, I did not take this class for credit

DEV288x Score

DEV290x – Computer Vision and Image Analysis

This course is an excellent overview of the state-of-the-art in computer vision. It starts with a description of classical methods including thresholding, clustering, region growing, template matching, and feature detection (Sobel edges and Harris corners).

Next it covers object classification and detection algorithms such as Viola-Jones, histogram of oriented gradients (HOG), deep learning, extending classifiers into detectors, object proposals, and introduces convolutional neural networks (CNN).

Finally, the course introduces advanced topics such as super-pixels and conditional random fields, deep segmentation, and transfer learning.

Time: 10 hours for 20 modules

Score: Missed 4 quiz questions and 1 final exam question for a score of 94%

DEV290x Score  DEV290x Certificate

DAT264x – Microsoft Professional Capstone : Artificial Intelligence

This is the last class for the certificate. Similar to the capstone for the Microsoft Data Science certificate, it is a month-long project designed as a contest hosted by Unlike the capstone class for the Data Science certificate, there is no report, the grade is based solely on the contest score. For a description of the April 2019 contest, see this [future date] blog post.

I used Microsoft’s cognitive neural toolkit (CNTK) package for Python for my solution. I had a hard time debugging my code. CNTK is not as popular as Google’s Tensorflow, so searching error messages on the web gives few results.

Time: 30 hours for single assignment

Score: Log-loss error of 0.22 for a final score of 95%

DAT264x Score  DAT264x Certificate

Final Certificate

Below is my certificate of completion for the Microsoft Professional Program, Artificial Intelligence Certificate.


* * * *

As an aside, starting in January 2019, edX has changed the way it handles students who audit courses. To encourage more students to pay for its courses, edX now limits access to course content to 30 days after enrollment. After 30 days, you lose access, even if you have posted items on the discussion board. Further, it eliminated access to the assessment content (quizzes, labs, and exams) entirely unless you pay. This sucks.

I hope Microsoft will provide funding to edX to allow audit students to participate. Or Microsoft should stop working with edX and move its content to another MOOC platform that supports audit students. I’ve paid over $2000 to participate in the edX courses. But I always audit a course before paying for the content. I think the try-before-you-buy model is essential to get students to trust they will get their money’s worth. Preventing audit students from seeing the assessment content will make it difficult to gain their trust in the value of that content.


by George Taniwaki

After completing the requirements for the Microsoft Data Science Certificate (see Jul 2017 blog post), I decided to continue my training and complete the requirements for the Microsoft Artificial Intelligence Certificate.

The AI certificate is similar to the Data Science certificate. It consists of ten courses with content produced by Microsoft and administered by edX. However, unlike the data science courses, none of the AI course assignments use the drag-and-drop Azure Machine Learning interface. Instead, most projects require Python programming ability. A summary of my progress in the first six classes is shown below.

DAT 263x – Introduction to Artificial Intelligence (AI)

This is a brief overview of artificial intelligence and a plug for the Azure services that support AI. It covers the following topics:

  1. Machine learning – Azure machine learning studio
  2. Language – text processing, natural language processing, Azure language understanding intelligent services (LUIS)
  3. Computer vision – Image processing, Azure face detection and recognition, Azure video indexer
  4. Conversation – Microsoft Bot Framework, Cortana skills  (Text analytics API, Linguistic analysis API, Bing speech-to-text, text-to-speech, translation)
  5. Deep learning – Microsoft cognitive toolkit (CNTK), Azure Data science virtual machine

Oddly, none of the classes that follow use any of the Azure services introduced in this course. Instead, most rely on Python code contained in Jupyter notebooks.

Another quibble. I work at Microsoft (but not for Microsoft since I am a contractor) and most Azure subscriptions are not available to me. Who knows why Microsoft lets me create an account but then doesn’t give me access to resources. See screenshot below.


Time: 6 hours for 4 modules

Score: No missed question for score of 100%

DAT263x Score  DAT263x Certificate

DAT 208x – Introduction to Python for Data Science

This is a DataCamp course using an interactive window for quizzes and a timed final exam. (For my earlier experiences with DataCamp courses, see DAT209x in this Jul 2018 blog post.) The topics covered in this Python course are lists, functions and methods, flow control, installing packages, arrays using NumPy, graphing using MatPlotLib, and dataframes using Pandas.

Time: 12 hours for 20 modules

Score: 100% on the quizzes and labs. Missed 7 on final exam for combined score of 94%

DAT208x Score  DAT208x Certificate

DAT 256x – Essential Math for Machine Learning: Python Edition

This basic math course covers algebra, calculus, tensors, eigenvectors and eigenvalues, statistics, probability theory, sampling, and hypothesis testing. All lessons use Python in Jupyter notebooks.

Time: 8 hours for 4 modules

Score: Missed 1 question for score of 97%

DAT256x Score  DAT256x Certificate

DAT249x – Ethics and Law in Data and Analytics

This is a new course that is now required for both the data science certificate and the artificial intelligence certificate. It covers privacy (including GDPR), explainability (XAI), and power and trust (bias). The course is taught using the traditional legal framework called Issue, Rule, Application, Conclusion (IRAC).

Time: 5 hours for 4 modules

Score: No error in the labs and missed 2 questions on the final exam for 96%

DAT249x Score  DAT249x Certificate

DAT203.1x – Data Science Essentials

I took this class as part of my certificate for Data Science. See my Jul 2017 blog post for details.

DAT203.2x – Principles of Machine Learning

I took this class as part of my certificate for Data Science. See my Jul 2018 blog post for details.

For my results in the remaining classes, see Microsoft Artificial Intelligence Certificate-Part 2


by George Taniwaki

I have now completed all ten classes required to receive the Microsoft Data Science Certificate. As a newly minted data scientist, I am ready to dig into large datasets and make incredible (and if I am not careful, potentially unverifiable) predictions. A description of the four I took during the past two quarters is shown below. (For a list of the first six classes I took, see this Jul 2017 blog post.)

DAT209x – Programming with R for Data Science

This class is a continuation of DAT204x. Topics covered include functions and data structures, loops and flow control, working with vectors and matrices, reading and writing data files, reading SQL databases, manipulating data (i.e., merging, subsetting, filtering, introduction to grep and text functions, using date and time functions, aggregating, grouping, and summarizing), simulation, linear models, and graphics.

The final exam for this course used the same DataCamp-based timed format as in DAT204x. I didn’t do well on that test, but felt confident going in this time because I was prepared for it. Unfortunately, I ended up failing the test because I could not answer the first question and then could not navigate past it. After spending over half an hour trying to resolve the issue and contacting tech support (which promised a reply within 24 hours but never responded), I gave up. How annoying. Luckily, it didn’t matter since I already had enough points to pass the class. If I was susceptible to test anxiety, this would have been a traumatic experience.

Time: 12 hours for 12 modules

Score: 100% on all exercises (not graded) and labs. Missed 1 quiz question because my instance of the ggplot2 library behaved differently than the one used in class. Got zero on the final exam (see above) for a combined score of 79%

DAT209x Score   DAT209x Certificate

DAT203.2x – Principles of Machine Learning

This class is a continuation of DAT203.1x. It covers the the theory and application (using Microsoft Azure Machine Learning) of popular classification models including logistic regression, boosted decision trees, neural networks, and support vector machines (SVM). The models are tuned (that is, optimized for accuracy on the data outside the training set) using permutation, regularization, and cross-validation, which are all ensemble learning methods.

The class covers the two most popular continuous models, regression and random forest. Beyond prediction models, the class covers K-mean clusters and the matchbox recommender system developed by Microsoft Research.

Similar to my experience with DAT203.1x, I am really impressed with the power of MAML but a bit disappointed in the cookbook nature of the labs in the class.

The final exam was an assignment to create a prediction model for the number of minutes difference between the actual arrival time and the scheduled arrival time for commercial airline flights. The grade was based on quality of the predictions for 25 flights outside the training set. The final was graded with one point for each prediction that was within 10 minutes of the actual arrival time.

My model had a mean absolute error of about 8.5 (a bit high since the goal is to get within 10 minutes), but I got 22 out of 25 predictions within the allowed range (88%). I am guessing that with more engineering effort I could reduce my error size. For instance, I could have created a categorical variable for gate load that segmented flight arrivals into weekday morning (a busy time), weekday midday, weekday evening (also busy), weekend day, any night, and heavy traffic days before and after major holidays.

But I don’t think those improvements would help with the 3 cases that my model got wrong. They appear to be outliers that could not be predicted using the variables at hand. A better prediction would need to include variables that were not available in the dataset like weather at the arrival airport, airport construction status, and if any landing restrictions were in effect at the airport that day.

Time: 12 hours for 6 modules

Score: 100% on the 6 labs, 88% accuracy in final model, and 100% on 2 surveys, for a combined score of 95%

DAT203.2x Score     DAT203.2x Certificate

DAT203.3x – Applied Machine Learning

This class is a continuation of DAT203.2x. It consists of four distinct modules. The first introduces time series analysis with emphasis on seasonal decomposition of time series using LOESS (STL) and transforming the data into a stationary process using autoreressive integrated moving average (ARIMA).

Next, the class covers spatial data analysis with interpolation using kernel density estimation and K-nearest neighbor. The data is modeled using spatial Poisson process, variograms, and Kriging. The resulting output can be displayed using dot maps, bubble maps, heat maps, and chloropleth maps. The analysis is done using R in both a Jupyter notebook in MAML and as a stored procedure in SQL Server 2016 R services and R script in Power BI .

The third module covers text analysis. English text is processed using text normalization, removing stop words, and stemming. These methods may not be applicable in other languages or scripts. Once the text is clean, the text can be analyzed for word frequency, word importance, named entity recognition, text mining, and sentiment analysis. All of these techniques are used in natural language processing.

The final module introduces image processing and image analysis using Python routines in matplotlib. Example techniques include denoising using convolution with Gaussian blur or median filter and prewhitening using Gaussian noise. Python code is used to show how to resize and rotate images. Feature extraction is demonstrated using Sobel edge detection, segmentation, and Harris corner detection. The basic image morphology operators are introduced, including dilation, eroding, opening, and closing. The course also introduces the cognitive services APIs available in Azure portal and how to access them using Python and C#.

Time: 12 hours for 4 modules

Score: 100% on the 4 labs and missed one question on the final exam for a combined score of 95%

DAT203.3x Score     DAT203.3x Certificate

DAT102x – Data Science Professional Capstone

This is the tenth and final course needed to receive the Microsoft Data Science Certificate. The class lasts one month and consists of a machine learning contest and report. The project changes every quarter. For a description of the April contest, see the 3 part blog entry that starts  June 2018.

Time: 10 hours for creating a machine learning model and writing the report

Score: Missed one question in the data analysis section (because the mean and standard deviation had to be reported to six decimal places and I only entered them to 2 decimals), scored 86% on my machine learning model, and 100% on my report graded by 3 class peers, for a final score of 91%

DAT102x Score     DAT102x Certificate

Final Certificate

And finally, below is the Data Science Certificate.



Town of Barpak after Gorkha earthquake. Image from The Telegraph (UK)


by George Taniwaki

This is the final set of my notes from a machine learning class offered by edX. Part 1 of this blog entry is posted in June 2018.

Step 7: Optimize model

At the end of step 6, I discovered that none of my three models met the minimum F score (at least 0.60) needed to pass the class. Starting with the configuration shown in Figure 5, I modified my experiment by replacing the static data split with partition and sampling using 10 evenly split folds. I used a random seed of 123 to ensure reproducibility.

I added both a cross-validation step and a hyperparameter tuning step to optimize results. To improve performance, I added a Convert to indicator values module. This converts the categorical variables into dummy binary variables before running the model.

Unfortunately, the MAML ordinal regression module does not support hyperparameter tuning. So I replaced it with the one-vs-all multiclass classifier. The new configuration is shown in Figure 6 below. (Much thanks to my classmate Robert Ritz for sharing his model.)

Figure 6. Layout of MAML Studio experiment with partitioning and hyperparameter tuning steps


For an explanation of how hyperparameter tuning works, see Microsoft documentation and MSDN blog post.

Model 5 – One-vs-all multiclass model using logistic regression classifier

In the earlier experiments, the two-class logistic regression classifier gave the best results. I will use it again with the one-vs-all multiclass model. The default parameter ranges for the two-class logistic regression classifier are: Optimization tolerance = 1E-4, 1E-7, L1 regularization weight = 0 .01, 0.1, 1.0, L2 regularization weight = 0.01, 0.1, 1.0, and memory size for L-BFGS = 5, 20, 50.

Table 12a. Truth table for one-vs-all multiclass model using logistic regression classifier

Truth table Is 1 Is 2 Is 3 TOTAL
Predict 1 296 156 20 472
Predict 2 621 4633 1651 6905
Predict 3 21 847 1755 2623
TOTAL 936 5636 3426 10000


Table 12b. Performance measures for one-vs-all multiclass model using logistic regression classifier

Performance Value
Avg Accuracy 0.76
F1 Score 0.64
F1 Score (test data) Not submitted


The result is disappointing. The new model has an F1 score of 0.64, which is lower than the F1 score of the ordinal regression model using the logistic regression classifier.

Model 6 – Add geo_level_2 to model

Originally, I excluded geo_level_2 from the model even though the Chi-square test was significant because it consumed too many degrees of freedom. I rerun the experiment with the variable and keeping all other variables and parameters the same.

Table 13a. Truth table for one-vs-all multiclass model using logistic regression classifier and including geo_level_2

Truth table Is 1 Is 2 Is 3 TOTAL
Predict 1 355 218 27 600
Predict 2 564 4662 1446 6672
Predict 3 19 756 1953 2728
TOTAL 938 5636 3426 10000


Table 13b. Performance measures for one-vs-all multiclass model using logistic regression classifier and including geo_level_2

Performance Value
Avg Accuracy 0.80
F1 Score 0.70
F1 Score (test data) Not submitted


The resulting F1 score using the test dataset is 0.70, which is better than any prior experiments and meets our target of 0.70 exactly.

Model 7 – Add height/floor to the model

I will try to improve the model by adding a variable measuring height/floor. This variable is always positive, skewed toward zero and has a long tail. To normalize it, I apply the natural log transform and name the variable ln_height_per_floor. Table 14 and Figure 7 show the summary statistics.

Table 14. Descriptive statistics for ln_height_per_floor

Variable name Min Median Max Mean Std dev
ln_height_per_floor -1.79 0.69 2.30 0.76 0.25


Figure 7. Histogram of ln_height_per_floor


I run the model again with no other changes.

Table 15a. Truth table for one-vs-all multiclass model using logistic regression classifier, including geo_level_2, height/floor

Truth table Is 1 Is 2 Is 3 TOTAL
Predict 1 366 227 28 621
Predict 2 557 4640 1436 6633
Predict 3 15 769 1962 2746
TOTAL 938 5636 3426 10000


Table 15b. Performance measures for one-vs-all multiclass model using logistic regression classifier, including geo_level_2, height/floor

Performance Value
Avg Accuracy 0.80
F1 Score 0.70
F1 Score (test data) Not submitted


The accuracy of predicting damage_level = 1 or 3 increases, but the accuracy of 2 decreases. Resulting in no change in average accuracy or the F1 score.

Model 8 – Go back to ordinal regression

The accuracy of the one-vs-all multiclass model was significantly improved by adding geo_level_2. Let’s see what happens if I add this variable to the ordinal regression model which produced a higher F1 score than the one-vs-all model.

Table 16a. Truth table for ordinal regression model using logistic regression classifier, including geo_level_2, height/floor

Truth table Is 1 Is 2 Is 3 TOTAL
Predict 1 80 59 1 140
Predict 2 227 1557 542 2326
Predict 3 3 246 585 834
TOTAL 310 1862 1128 3300


Table 16b. Performance measures for ordinal regression model using logistic regression classifier, including geo_level_2, height/floor

Performance Value
Avg Accuracy 0.78
F1 Score 0.67
F1 Score (test data) Not submitted


Surprisingly, ordinal regression produces worse results when the geo_level_2 variable is included than without it.

Model 9 – Convert numeric to categorical

I spent a lot of effort adjusting and normalizing my numeric variables. They were mostly integer values with small range and did not appear to be correlated to damage_grade. Could the model be improved by treating them as categorical? Let’s find out.

First I perform a Chi Square test to confirm all of the variables are significant. Then run the model after converting all the values from numeric to strings, and converting all the variables from numeric to categorical.

Table 17. Chi-square results of numerical values to damage_grade

Variable name Chi-square Deg. of freedom P value
count_floor_pre_eq 495 14 < 2.2E-16*
height 367 37 < 2.2e-16*
age 690 60 < 2.2e-16*
area 738 314 < 2.2e-16*
count_families 76 14 1.3e-10*
count_superstructure 104 14 7.1e-16*
count_secondary_use 79 4 3.6e-16*

*One or more enums have sample sizes too small to use Chi-square approximation
[ ] P value greater than 0.05 significance level

Table 18a. Truth table for ordinal regression model using logistic regression classifier, including geo_level_2, height/floor, and converting numeric to categorical

Truth table Is 1 Is 2 Is 3 TOTAL
Predict 1 83 62 3 148
Predict 2 224 1544 540 2308
Predict 3 3 256 585 844
TOTAL 310 1862 1128 3300


Table 18b. Performance measures for ordinal regression model using logistic regression classifier, including geo_level_2, height/floor, and converting numeric to categorical

Performance Value
Avg Accuracy 0.78
F1 Score 0.67
F1 Score (test data) Not submitted


Changing the integer variables to categorical has almost no impact on the F1 score.


Table 19 below summarizes all nine models I built. Six of them achieved an F1 score of 0.60 or higher on the training data, which would probably have been sufficient to pass the class. Two of them had F1 score of 0.70 which would be a grade of 95 out of 100.

I was unable to run most of these models on the test dataset and submit the results to the data science capstone website. Thus, I do not know what my leaderboard F1 score would be. It is possible that I overfit my model to the training data and my leaderboard F1 score might be lower.

Finding the best combination of variables, models, and model hyperparameters is difficult to do manually. It took me several hours to build the nine models described in this blog post. Machine learning automation tools exist but are not yet robust, nor built into platforms like MAML Studio. (Much thanks again to Robert Ritz who pointed me to TPOT, a Python-based tool for auto ML.)

Table 19. Summary of models. Green indicates differences from base case, model 2

Model Variables Algorithm Training data F1 score (test data)
1 None Naïve guess = 2 None 0.56
3 27 from Table 5 Ordinal regression with decision forest 0.67 split 0.64
4 27 from Table 5 Ordinal regression with SVM 0.67 split 0.57 (0.5644)
2 27 from Table 5 Ordinal regression with logistic regression 0.67 split 0.68 (0.5687)
5 27 from Table 5 One-vs-all multiclass with logistic regression, hyperparameter tuning 10-fold partition 0.64
6 27 from Table 5, geo_level_2 One-vs-all multiclass with logistic regression, hyperparameter tuning 10-fold partition 0.70
7 27 from Table 5, geo_level_2, height/floor One-vs-all multiclass with logistic regression, hyperparameter tuning 10-fold partition 0.70
8 27 from Table 5, geo_level_2, height/floor Ordinal regression with logistic regression 0.67 split 0.67
9 27 from Table 5, convert numeric to categorical, geo_level_2, height/floor Ordinal regression with logistic regression 0.67 split 0.67