MAML

by George Taniwaki

I have now completed all ten classes required to receive the Microsoft Data Science Certificate. As a newly minted data scientist, I am ready to dig into large datasets and make incredible (and if I am not careful, potentially unverifiable) predictions. A description of the four I took during the past two quarters is shown below. (For a list of the first six classes I took, see this Jul 2017 blog post.)

DAT209x – Programming with R for Data Science

This class is a continuation of DAT204x. Topics covered include functions and data structures, loops and flow control, working with vectors and matrices, reading and writing data files, reading SQL databases, manipulating data (i.e., merging, subsetting, filtering, introduction to grep and text functions, using date and time functions, aggregating, grouping, and summarizing), simulation, linear models, and graphics.

The final exam for this course used the same DataCamp-based timed format as in DAT204x. I didn’t do well on that test, but felt confident going in this time because I was prepared for it. Unfortunately, I ended up failing the test because I could not answer the first question and then could not navigate past it. After spending over half an hour trying to resolve the issue and contacting tech support (which promised a reply within 24 hours but never responded), I gave up. How annoying. Luckily, it didn’t matter since I already had enough points to pass the class. If I was susceptible to test anxiety, this would have been a traumatic experience.

Time: 12 hours for 12 modules

Score: 100% on all exercises (not graded) and labs. Missed 1 quiz question because my instance of the ggplot2 library behaved differently than the one used in class. Got zero on the final exam (see above) for a combined score of 79%

DAT209x Score   DAT209x Certificate

DAT203.2x – Principles of Machine Learning

This class is a continuation of DAT203.1x. It covers the the theory and application (using Microsoft Azure Machine Learning) of popular classification models including logistic regression, boosted decision trees, neural networks, and support vector machines (SVM). The models are tuned (that is, optimized for accuracy on the data outside the training set) using permutation, regularization, and cross-validation, which are all ensemble learning methods.

The class covers the two most popular continuous models, regression and random forest. Beyond prediction models, the class covers K-mean clusters and the matchbox recommender system developed by Microsoft Research.

Similar to my experience with DAT203.1x, I am really impressed with the power of MAML but a bit disappointed in the cookbook nature of the labs in the class.

The final exam was an assignment to create a prediction model for the number of minutes difference between the actual arrival time and the scheduled arrival time for commercial airline flights. The grade was based on quality of the predictions for 25 flights outside the training set. The final was graded with one point for each prediction that was within 10 minutes of the actual arrival time.

My model had a mean absolute error of about 8.5 (a bit high since the goal is to get within 10 minutes), but I got 22 out of 25 predictions within the allowed range (88%). I am guessing that with more engineering effort I could reduce my error size. For instance, I could have created a categorical variable for gate load that segmented flight arrivals into weekday morning (a busy time), weekday midday, weekday evening (also busy), weekend day, any night, and heavy traffic days before and after major holidays.

But I don’t think those improvements would help with the 3 cases that my model got wrong. They appear to be outliers that could not be predicted using the variables at hand. A better prediction would need to include variables that were not available in the dataset like weather at the arrival airport, airport construction status, and if any landing restrictions were in effect at the airport that day.

Time: 12 hours for 6 modules

Score: 100% on the 6 labs, 88% accuracy in final model, and 100% on 2 surveys, for a combined score of 95%

DAT203.2x Score     DAT203.2x Certificate

DAT203.3x – Applied Machine Learning

This class is a continuation of DAT203.2x. It consists of four distinct modules. The first introduces time series analysis with emphasis on seasonal decomposition of time series using LOESS (STL) and transforming the data into a stationary process using autoreressive integrated moving average (ARIMA).

Next, the class covers spatial data analysis with interpolation using kernel density estimation and K-nearest neighbor. The data is modeled using spatial Poisson process, variograms, and Kriging. The resulting output can be displayed using dot maps, bubble maps, heat maps, and chloropleth maps. The analysis is done using R in both a Jupyter notebook in MAML and as a stored procedure in SQL Server 2016 R services and R script in Power BI .

The third module covers text analysis. English text is processed using text normalization, removing stop words, and stemming. These methods may not be applicable in other languages or scripts. Once the text is clean, the text can be analyzed for word frequency, word importance, named entity recognition, text mining, and sentiment analysis. All of these techniques are used in natural language processing.

The final module introduces image processing and image analysis using Python routines in matplotlib. Example techniques include denoising using convolution with Gaussian blur or median filter and prewhitening using Gaussian noise. Python code is used to show how to resize and rotate images. Feature extraction is demonstrated using Sobel edge detection, segmentation, and Harris corner detection. The basic image morphology operators are introduced, including dilation, eroding, opening, and closing. The course also introduces the cognitive services APIs available in Azure portal and how to access them using Python and C#.

Time: 12 hours for 4 modules

Score: 100% on the 4 labs and missed one question on the final exam for a combined score of 95%

DAT203.3x Score     DAT203.3x Certificate

DAT102x – Data Science Professional Capstone

This is the tenth and final course needed to receive the Microsoft Data Science Certificate. The class lasts one month and consists of a machine learning contest and report. The project changes every quarter. For a description of the April contest, see the 3 part blog entry that starts  June 2018.

Time: 10 hours for creating a machine learning model and writing the report

Score: Missed one question in the data analysis section (because the mean and standard deviation had to be reported to six decimal places and I only entered them to 2 decimals), scored 86% on my machine learning model, and 100% on my report graded by 3 class peers, for a final score of 91%

DAT102x Score     DAT102x Certificate

Final Certificate

And finally, below is the Data Science Certificate.

DataScienceCertificate

MsftBigData

by George Taniwaki

Big data and machine learning are all the rage now. Articles in the popular press inform us that anyone who can master the skills needed to turn giant piles of previously unexplored data into golden nuggets of business insight can write their own ticket to a fun and remunerative career (efinancialcareers May 2017).

Conversely, the press also tells us that if we don’t learn these skills a computer will take our job (USA Today Mar 2014). I will have a lot more to say about changes in employment and income during the industrial revolution in future blog posts.

But how do you learn to become a data scientist. And which software stack should one specialize in? There are many tools to choose from. Since I live in the Seattle area and do a lot of work for Microsoft, I decided to do take an online class developed and sponsored by Microsoft and edX. Completion of the course leads to a Microsoft Data Science Certificate.

The program consists of 10 courses with some choices, like conducting analysis using either Excel or Power BI, and programming using either R or Python. Other parts of the Microsoft stack you will learn include SQL Server for queries and Microsoft Azure Machine Learning (MAML) for analysis and visualization of results. The courses are priced about $99 each. You can audit them for free if you don’t care about the certificates.

I started the program in February and am about half way done. In case any clients or potential employers are interested in my credentials, my progress is shown below.

DAT101x – Data Science Orientation

If you haven’t been in college in a while or have never taken an online class, this is a good introduction to online learning. The homework consists of some simple statistics and visualization problems.

Time: 3 hours for 3 modules

Score: 100% on 3 assignments

DAT101x Score    DAT101x Certificate

DAT201x – Querying with Transact-SQL

I took a t-SQL class online at Bellevue College two years ago. Taking a class with a real teacher, even one you never meet, was a significantly better experience than a self-paced mooc. This course starts with the basics like select, subqueries, and variables. It also covers intermediate topics like programming, expressions, stored procedures, and error handling. I did my homework using both a local instance of SQL Server and on an Azure SQL database.

Time: 20 hours for 11 modules

Score: I missed one question in the homework and two in the final exam for a combined score of 94%

DAT201x Score     DAT201x Certificate

DAT207x – Analyzing and Visualizing Data with Power BI

I already have experience creating reports using Power BI. I also use Power Query (now called get and transform data) and M language and Power Pivot and DAX language, so this was an easy class.

The course covers data transforms, modeling, visualization, Power BI web service, organization packs, security and groups. It also touches on the developer API and building mobile apps.

Time: 12 hours for 9 modules

Score: I missed one lab question for a combined score of 98%

DAT207x Score     DAT207x Certificate

DAT222x – Essential Statistics for Data Analysis using Excel

This class is comprehensive and covers all the standard statistics and probability topics including descriptive statistics, Bayes rule, random variables, central limit theorem, sampling and confidence interval, and hypothesis testing. Most analysis is conducted using the Data analysis pack add-in for Excel.

Time: I used to work in market research, so I know my statistics. However, there are 36 homework assignments and it took me over 20 hours to complete the 5 modules.

Score: I missed 9 questions on the quizzes (88%) and six in the final exam (81%) for a combined score of 86%. (Despite the time it takes to complete, homework counts very little toward the final grade)

DAT222x Score     DAT222x Certificate

DAT204x – Introduction to R for Data Science

Now we are getting into the meat of the program. R is a functional language. In many ways it is similar to the M language used in Power Query. I was able to quickly learn the syntax and grasp the core concepts.

The course covers vectors, matrices, factors, lists, data frames, and simple graphics.

The lab assignments use DataCamp which has a script window where you write code and a console window that displays results. That makes it easy to debug programs as you write them.

The final exam used an unexpected format. It was timed and consisted of about 50 questions, mostly fill-in-the-blank responses that include code snippets. You are given 4 minutes per question. If you don’t answer within the time limit, it goes to the next question. I completed the test in about 70 minutes, but I ran out of time on several questions, and was exhausted at the end. I’m not convinced that a timed test is the best way to measure subject mastery by a beginning programmer. But maybe that is just rationalization on my part.

Time: 15 hours for 7 modules

Score: I got all the exercises (ungraded) and labs right and missed two questions in the quizzes. I only got 74% on the final, for a combined score of 88%

DAT204x Score     DAT204x Certificate

DAT203.1x Data Science Essentials

The first three modules in this course covered statistics and was mostly a repeat of the material introduced in DAT222x. But the rest of the course provides an excellent introduction to machine learning. You learn how to create a MAML instance, import a SQL query, manipulate it using R or Python, create a model, score it, publish it as a web service, and use the web service to append predictions as a column in Excel. I really like MAML. I will post a review of my experience in a future blog post.

The course was a little too cookbook-like for my taste. It consisted mostly of following directions to drag-drop boxes onto the canvas UI and copy-paste code snippets into the panels. However, if you want a quick introduction to machine learning without having to dig into the details of SQL, R, or Python, this is a great course.

Time: 10 hours for 6 modules

Score: 100% on the 6 labs and the final

DAT203.1x Score     DAT203.1x Certificate

I have now completed six out of the ten courses required for a certificate. I expect to finish the remaining 4 needed for a certificate by the end of the year. I will also probably take some of the other elective courses simply to learn more about Microsoft’s other machine learning and cloud services.

For my results in the remaining classes, see Microsoft Data Science Certificate-Part 2

Update: Modified the description of the final exam for DAT204x.

Dante Chinni and James Gimpel, the authors of Our Patchwork Nation: The Surprising Truth About the “Real” America, look at demographic data of the United States at the county level. There are 3,141 counties in the U.S. and though they vary considerably in geographical size and population, using county-level maps to display data provides a convenient way to compare and contrast demographic data.

By mapping demographic data at the county level, you can see how attributes like population density, income, education, attitudes, behaviors, and health are distributed across the U.S.

Unfortunately, the authors take the wonderfully detailed data available from various sources at the county level and use segmentation analysis to group the counties into twelve categories and give them cute names. Ugh.

OurPatchworkNation

Our Patchwork Nation. Image from Amazon

Luckily, the authors provide access to the raw county-level data at their website, patchworknation.org. You can view county level chloropleth maps for a wide variety of data. There is even a tool to overlay two maps to do comparisons. The tool doesn’t work very well since you cannot select the colors of the overlays. But overall, the patchworknation site has some of the best U.S. data maps available.

An example of the problem using segments rather than the raw data is illustrated in an article that appears in The Atlantic Apr 2010. The map shown below shows the 12 segments. But the user has to flip back and forth between the legend and the map to determine what each color means.

12States

The 12 states of America. Image from PatchworkNation

The raw data, at the Patchwork Nation website is shown in the three maps below. The first map shows the median household income in 1980. The colors show which quintile each county falls into. The second map shows quintiles for 2010. It is hard to compare the two maps to see how the distribution changes. Going to the Patchwork Nation website and toggling between the two maps makes it easier.

12StatesIncome1980

Distribution of median household income by county in 1980. From Patchwork Nation

12StatesIncome2010

Distribution of median household income by county in 2010. From Patchwork Nation

The map below shows the change in median income for each county using 2010 adjusted dollars. Notice the counties with large and growing urban populations, mostly on the east and west coasts. They had the highest median incomes in 1980 and in 2010. They also show the highest growth in median income while the remaining counties show smaller gains or a loss. The gap in income distribution in the U.S. is growing.

For more on this story, read the article at the Patchwork Nation website.

12StatesIncomeGrowth1980-2010

Change in median household income by county from 1980 to 2010. From Patchwork Nation