MsftBigData

by George Taniwaki

Big data and machine learning are all the rage now. Articles in the popular press inform us that anyone who can master the skills needed to turn giant piles of previously unexplored data into golden nuggets of business insight can write their own ticket to a fun and remunerative career (efinancialcareers May 2017).

Conversely, the press also tells us that if we don’t learn these skills a computer will take our job (USA Today Mar 2014). I will have a lot more to say about changes in employment and income during the industrial revolution in future blog posts.

But how do you learn to become a data scientist. And which software stack should one specialize in? There are many tools to choose from. Since I live in the Seattle area and do a lot of work for Microsoft, I decided to do take an online class developed and sponsored by Microsoft and edX. Completion of the course leads to a Microsoft Data Science Certificate.

The program consists of 10 courses with some choices, like conducting analysis using either Excel or Power BI, and programming using either R or Python. Other parts of the Microsoft stack you will learn include SQL Server for queries and Microsoft Azure Machine Learning (MAML) for analysis and visualization of results. The courses are priced about $99 each. You can audit them for free if you don’t care about the certificates.

I started the program in February and am about half way done. In case any clients or potential employers are interested in my credentials, my progress is shown below.

DAT101x – Data Science Orientation

If you haven’t been in college in a while or have never taken an online class, this is a good introduction to online learning. The homework consists of some simple statistics and visualization problems.

Time: 3 hours for 3 modules

Score: 100% on 3 assignments

DAT101x Score    DAT101x Certificate

DAT201x – Querying with Transact-SQL

I took a t-SQL class online at Bellevue College two years ago. Taking a class with a real teacher, even one you never meet, was a significantly better experience than a self-paced mooc. This course starts with the basics like select, subqueries, and variables. It also covers intermediate topics like programming, expressions, stored procedures, and error handling. I did my homework using both a local instance of SQL Server and on an Azure SQL database.

Time: 20 hours for 11 modules

Score: I missed one question in the homework and two in the final exam for a combined score of 94%

DAT201x Score     DAT201x Certificate

DAT207x – Analyzing and Visualizing Data with Power BI

I already have experience creating reports using Power BI. I also use Power Query (now called get and transform data) and M language and Power Pivot and DAX language, so this was an easy class.

The course covers data transforms, modeling, visualization, Power BI web service, organization packs, security and groups. It also touches on the developer API and building mobile apps.

Time: 12 hours for 9 modules

Score: I missed one lab question for a combined score of 98%

DAT207x Score     DAT207x Certificate

DAT222x – Essential Statistics for Data Analysis using Excel

This class is comprehensive and covers all the standard statistics and probability topics including descriptive statistics, Bayes rule, random variables, central limit theorem, sampling and confidence interval, and hypothesis testing. Most analysis is conducted using the Data analysis pack add-in for Excel.

Time: I used to work in market research, so I know my statistics. However, there are 36 homework assignments and it took me over 20 hours to complete the 5 modules.

Score: I missed 9 questions on the quizzes (88%) and six in the final exam (81%) for a combined score of 86%. (Despite the time it takes to complete, homework counts very little toward the final grade)

DAT222x Score     DAT222x Certificate

DAT204x – Introduction to R for Data Science

Now we are getting into the meat of the program. R is a functional language. In many ways it is similar to the M language used in Power Query. I was able to quickly learn the syntax and grasp the core concepts.

The course covers vectors, matrices, factors, lists, data frames, and simple graphics.

The lab assignments use DataCamp which has a script window where you write code and a console window that displays results. That makes it easy to debug programs as you write them.

The final exam used an unexpected format. It was timed and consisted of about 50 questions, mostly fill-in-the-blank responses that include code snippets. You are given 4 minutes per question. If you don’t answer within the time limit, it goes to the next question. I completed the test in about 70 minutes, but I ran out of time on several questions, and was exhausted at the end. I’m not convinced that a timed test is the best way to measure subject mastery by a beginning programmer. But maybe that is just rationalization on my part.

Time: 15 hours for 7 modules

Score: I got all the exercises (ungraded) and labs right and missed two questions in the quizzes. I only got 74% on the final, for a combined score of 88%

DAT204x Score     DAT204x Certificate

DAT203.1x Data Science Essentials

The first three modules in this course covered statistics and was mostly a repeat of the material introduced in DAT222x. But the rest of the course provides an excellent introduction to machine learning. You learn how to create a MAML instance, import a SQL query, manipulate it using R or Python, create a model, score it, publish it as a web service, and use the web service to append predictions as a column in Excel. I really like MAML. I will post a review of my experience in a future blog post.

The course was a little too cookbook-like for my taste. It consisted mostly of following directions to drag-drop boxes onto the canvas UI and copy-paste code snippets into the panels. However, if you want a quick introduction to machine learning without having to dig into the details of SQL, R, or Python, this is a great course.

Time: 10 hours for 6 modules

Score: 100% on the 6 labs and the final

DAT203.1x Score     DAT203.1x Certificate

I have now completed six out of the ten courses required for a certificate. I expect to finish the remaining 4 needed for a certificate by the end of the year. I will also probably take some of the other elective courses simply to learn more about Microsoft’s other machine learning and cloud services.

For my results in the remaining classes, see Microsoft Data Science Certificate-Part 2

Update: Modified the description of the final exam for DAT204x.

Advertisements

Dante Chinni and James Gimpel, the authors of Our Patchwork Nation: The Surprising Truth About the “Real” America, look at demographic data of the United States at the county level. There are 3,141 counties in the U.S. and though they vary considerably in geographical size and population, using county-level maps to display data provides a convenient way to compare and contrast demographic data.

By mapping demographic data at the county level, you can see how attributes like population density, income, education, attitudes, behaviors, and health are distributed across the U.S.

Unfortunately, the authors take the wonderfully detailed data available from various sources at the county level and use segmentation analysis to group the counties into twelve categories and give them cute names. Ugh.

OurPatchworkNation

Our Patchwork Nation. Image from Amazon

Luckily, the authors provide access to the raw county-level data at their website, patchworknation.org. You can view county level chloropleth maps for a wide variety of data. There is even a tool to overlay two maps to do comparisons. The tool doesn’t work very well since you cannot select the colors of the overlays. But overall, the patchworknation site has some of the best U.S. data maps available.

An example of the problem using segments rather than the raw data is illustrated in an article that appears in The Atlantic Apr 2010. The map shown below shows the 12 segments. But the user has to flip back and forth between the legend and the map to determine what each color means.

12States

The 12 states of America. Image from PatchworkNation

The raw data, at the Patchwork Nation website is shown in the three maps below. The first map shows the median household income in 1980. The colors show which quintile each county falls into. The second map shows quintiles for 2010. It is hard to compare the two maps to see how the distribution changes. Going to the Patchwork Nation website and toggling between the two maps makes it easier.

12StatesIncome1980

Distribution of median household income by county in 1980. From Patchwork Nation

12StatesIncome2010

Distribution of median household income by county in 2010. From Patchwork Nation

The map below shows the change in median income for each county using 2010 adjusted dollars. Notice the counties with large and growing urban populations, mostly on the east and west coasts. They had the highest median incomes in 1980 and in 2010. They also show the highest growth in median income while the remaining counties show smaller gains or a loss. The gap in income distribution in the U.S. is growing.

For more on this story, read the article at the Patchwork Nation website.

12StatesIncomeGrowth1980-2010

Change in median household income by county from 1980 to 2010. From Patchwork Nation