by George Taniwaki

Big data and machine learning are all the rage now. Articles in the popular press inform us that anyone who can master the skills needed to turn giant piles of previously unexplored data into golden nuggets of business insight can write their own ticket to a fun and remunerative career (efinancialcareers May 2017).

Conversely, the press also tells us that if we don’t learn these skills a computer will take our job (USA Today Mar 2014). I will have a lot more to say about changes in employment and income during the industrial revolution in future blog posts.

But how do you learn to become a data scientist. And which software stack should one specialize in? There are many tools to choose from. Since I live in the Seattle area and do a lot of work for Microsoft, I decided to do take an online class developed and sponsored by Microsoft and edX. Completion of the course leads to a Microsoft Data Science Certificate.

The program consists of 10 courses with some choices, like conducting analysis using either Excel or Power BI, and programming using either R or Python. Other parts of the Microsoft stack you will learn include SQL Server for queries and Microsoft Azure Machine Learning (MAML) for analysis and visualization of results. The courses are priced about $99 each. You can audit them for free if you don’t care about the certificates.

I started the program in February and am about half way done. In case any clients or potential employers are interested in my credentials, my progress is shown below.

DAT101x – Data Science Orientation

If you haven’t been in college in a while or have never taken an online class, this is a good introduction to online learning. The homework consists of some simple statistics and visualization problems.

Time: 3 hours for 3 modules

Score: 100% on 3 assignments

DAT101x Score    DAT101x Certificate

DAT201x – Querying with Transact-SQL

I took a t-SQL class online at Bellevue College two years ago. Taking a class with a real teacher, even one you never meet, was a significantly better experience than a self-paced mooc. This course starts with the basics like select, subqueries, and variables. It also covers intermediate topics like programming, expressions, stored procedures, and error handling. I did my homework using both a local instance of SQL Server and on an Azure SQL database.

Time: 20 hours for 11 modules

Score: I missed one question in the homework and two in the final exam for a combined score of 94%

DAT201x Score     DAT201x Certificate

DAT207x – Analyzing and Visualizing Data with Power BI

I already have experience creating reports using Power BI. I also use Power Query (now called get and transform data) and M language and Power Pivot and DAX language, so this was an easy class.

The course covers data transforms, modeling, visualization, Power BI web service, organization packs, security and groups. It also touches on the developer API and building mobile apps.

Time: 12 hours for 9 modules

Score: I missed one lab question for a combined score of 98%

DAT207x Score     DAT207x Certificate

DAT222x – Essential Statistics for Data Analysis using Excel

This class is comprehensive and covers all the standard statistics and probability topics including descriptive statistics, Bayes rule, random variables, central limit theorem, sampling and confidence interval, and hypothesis testing. Most analysis is conducted using the Data analysis pack add-in for Excel.

Time: I used to work in market research, so I know my statistics. However, there are 36 homework assignments and it took me over 20 hours to complete the 5 modules.

Score: I missed 9 questions on the quizzes (88%) and six in the final exam (81%) for a combined score of 86%. (Despite the time it takes to complete, homework counts very little toward the final grade)

DAT222x Score     DAT222x Certificate

DAT204x – Introduction to R for Data Science

Now we are getting into the meat of the program. R is a functional language. In many ways it is similar to the M language used in Power Query. I was able to quickly learn the syntax and grasp the core concepts.

The course covers vectors, matrices, factors, lists, data frames, and simple graphics.

The lab assignments use DataCamp which has a script window where you write code and a console window that displays results. That makes it easy to debug programs as you write them.

The final exam used an unexpected format. It was timed and consisted of about 50 questions, mostly fill-in-the-blank responses that include code snippets. You are given 4 minutes per question. If you don’t answer within the time limit, it goes to the next question. I completed the test in about 70 minutes, but I ran out of time on several questions, and was exhausted at the end. I’m not convinced that a timed test is the best way to measure subject mastery by a beginning programmer. But maybe that is just rationalization on my part.

Time: 15 hours for 7 modules

Score: I got all the exercises (ungraded) and labs right and missed two questions in the quizzes. I only got 74% on the final, for a combined score of 88%

DAT204x Score     DAT204x Certificate

DAT203.1x Data Science Essentials

The first three modules in this course covered statistics and was mostly a repeat of the material introduced in DAT222x. But the rest of the course provides an excellent introduction to machine learning. You learn how to create a MAML instance, import a SQL query, manipulate it using R or Python, create a model, score it, publish it as a web service, and use the web service to append predictions as a column in Excel. I really like MAML. I will post a review of my experience in a future blog post.

The course was a little too cookbook-like for my taste. It consisted mostly of following directions to drag-drop boxes onto the canvas UI and copy-paste code snippets into the panels. However, if you want a quick introduction to machine learning without having to dig into the details of SQL, R, or Python, this is a great course.

Time: 10 hours for 6 modules

Score: 100% on the 6 labs and the final

DAT203.1x Score     DAT203.1x Certificate

I have now completed six out of the ten courses required for a certificate. I expect to finish the remaining 4 needed for a certificate by the end of the year. I will also probably take some of the other elective courses simply to learn more about Microsoft’s other machine learning and cloud services.

For my results in the remaining classes, see Microsoft Data Science Certificate-Part 2

Update: Modified the description of the final exam for DAT204x.