by George Taniwaki

I have now completed all ten classes required to receive the Microsoft Data Science Certificate. As a newly minted data scientist, I am ready to dig into large datasets and make incredible (and if I am not careful, potentially unverifiable) predictions. A description of the four I took during the past two quarters is shown below. (For a list of the first six classes I took, see this Jul 2017 blog post.)

DAT209x – Programming with R for Data Science

This class is a continuation of DAT204x. Topics covered include functions and data structures, loops and flow control, working with vectors and matrices, reading and writing data files, reading SQL databases, manipulating data (i.e., merging, subsetting, filtering, introduction to grep and text functions, using date and time functions, aggregating, grouping, and summarizing), simulation, linear models, and graphics.

The final exam for this course used the same DataCamp-based timed format as in DAT204x. I didn’t do well on that test, but felt confident going in this time because I was prepared for it. Unfortunately, I ended up failing the test because I could not answer the first question and then could not navigate past it. After spending over half an hour trying to resolve the issue and contacting tech support (which promised a reply within 24 hours but never responded), I gave up. How annoying. Luckily, it didn’t matter since I already had enough points to pass the class. If I was susceptible to test anxiety, this would have been a traumatic experience.

Time: 12 hours for 12 modules

Score: 100% on all exercises (not graded) and labs. Missed 1 quiz question because my instance of the ggplot2 library behaved differently than the one used in class. Got zero on the final exam (see above) for a combined score of 79%

DAT209x Score   DAT209x Certificate

DAT203.2x – Principles of Machine Learning

This class is a continuation of DAT203.1x. It covers the the theory and application (using Microsoft Azure Machine Learning) of popular classification models including logistic regression, boosted decision trees, neural networks, and support vector machines (SVM). The models are tuned (that is, optimized for accuracy on the data outside the training set) using permutation, regularization, and cross-validation, which are all ensemble learning methods.

The class covers the two most popular continuous models, regression and random forest. Beyond prediction models, the class covers K-mean clusters and the matchbox recommender system developed by Microsoft Research.

Similar to my experience with DAT203.1x, I am really impressed with the power of MAML but a bit disappointed in the cookbook nature of the labs in the class.

The final exam was an assignment to create a prediction model for the number of minutes difference between the actual arrival time and the scheduled arrival time for commercial airline flights. The grade was based on quality of the predictions for 25 flights outside the training set. The final was graded with one point for each prediction that was within 10 minutes of the actual arrival time.

My model had a mean absolute error of about 8.5 (a bit high since the goal is to get within 10 minutes), but I got 22 out of 25 predictions within the allowed range (88%). I am guessing that with more engineering effort I could reduce my error size. For instance, I could have created a categorical variable for gate load that segmented flight arrivals into weekday morning (a busy time), weekday midday, weekday evening (also busy), weekend day, any night, and heavy traffic days before and after major holidays.

But I don’t think those improvements would help with the 3 cases that my model got wrong. They appear to be outliers that could not be predicted using the variables at hand. A better prediction would need to include variables that were not available in the dataset like weather at the arrival airport, airport construction status, and if any landing restrictions were in effect at the airport that day.

Time: 12 hours for 6 modules

Score: 100% on the 6 labs, 88% accuracy in final model, and 100% on 2 surveys, for a combined score of 95%

DAT203.2x Score     DAT203.2x Certificate

DAT203.3x – Applied Machine Learning

This class is a continuation of DAT203.2x. It consists of four distinct modules. The first introduces time series analysis with emphasis on seasonal decomposition of time series using LOESS (STL) and transforming the data into a stationary process using autoreressive integrated moving average (ARIMA).

Next, the class covers spatial data analysis with interpolation using kernel density estimation and K-nearest neighbor. The data is modeled using spatial Poisson process, variograms, and Kriging. The resulting output can be displayed using dot maps, bubble maps, heat maps, and chloropleth maps. The analysis is done using R in both a Jupyter notebook in MAML and as a stored procedure in SQL Server 2016 R services and R script in Power BI .

The third module covers text analysis. English text is processed using text normalization, removing stop words, and stemming. These methods may not be applicable in other languages or scripts. Once the text is clean, the text can be analyzed for word frequency, word importance, named entity recognition, text mining, and sentiment analysis. All of these techniques are used in natural language processing.

The final module introduces image processing and image analysis using Python routines in matplotlib. Example techniques include denoising using convolution with Gaussian blur or median filter and prewhitening using Gaussian noise. Python code is used to show how to resize and rotate images. Feature extraction is demonstrated using Sobel edge detection, segmentation, and Harris corner detection. The basic image morphology operators are introduced, including dilation, eroding, opening, and closing. The course also introduces the cognitive services APIs available in Azure portal and how to access them using Python and C#.

Time: 12 hours for 4 modules

Score: 100% on the 4 labs and missed one question on the final exam for a combined score of 95%

DAT203.3x Score     DAT203.3x Certificate

DAT102x – Data Science Professional Capstone

This is the tenth and final course needed to receive the Microsoft Data Science Certificate. The class lasts one month and consists of a machine learning contest and report. The project changes every quarter. For a description of the April contest, see the 3 part blog entry that starts  June 2018.

Time: 10 hours for creating a machine learning model and writing the report

Score: Missed one question in the data analysis section (because the mean and standard deviation had to be reported to six decimal places and I only entered them to 2 decimals), scored 86% on my machine learning model, and 100% on my report graded by 3 class peers, for a final score of 91%

DAT102x Score     DAT102x Certificate

Final Certificate

And finally, below is the Data Science Certificate.