MsftBigData

by George Taniwaki

Big data and machine learning are all the rage now. Articles in the popular press inform us that anyone who can master the skills needed to turn giant piles of previously unexplored data into golden nuggets of business insight can write their own ticket to a fun and remunerative career (efinancialcareers May 2017).

Conversely, the press also tells us that if we don’t learn these skills a computer will take our job (USA Today Mar 2014). I will have a lot more to say about changes in employment and income during the industrial revolution in future blog posts.

But how do you learn to become a data scientist. And which software stack should one specialize in? There are many tools to choose from. Since I live in the Seattle area and do a lot of work for Microsoft, I decided to do take an online class developed and sponsored by Microsoft and edX. Completion of the course leads to a Microsoft Data Science Certificate.

The program consists of 10 courses with some choices, like conducting analysis using either Excel or Power BI, and programming using either R or Python. Other parts of the Microsoft stack you will learn include SQL Server for queries and Microsoft Azure Machine Learning (MAML) for analysis and visualization of results. The courses are priced about $99 each. You can audit them for free if you don’t care about the certificates.

I started the program in February and am about half way done. In case any clients or potential employers are interested in my credentials, my progress is shown below.

DAT101x – Data Science Orientation

If you haven’t been in college in a while or have never taken an online class, this is a good introduction to online learning. The homework consists of some simple statistics and visualization problems.

Time: 3 hours for 3 modules

Score: 100% on 3 assignments

DAT101x Score    DAT101x Certificate

DAT201x – Querying with Transact-SQL

I took a t-SQL class online at Bellevue College two years ago. Taking a class with a real teacher, even one you never meet, was a significantly better experience than a self-paced mooc. This course starts with the basics like select, subqueries, and variables. It also covers intermediate topics like programming, expressions, stored procedures, and error handling. I did my homework using both a local instance of SQL Server and on an Azure SQL database.

Time: 20 hours for 11 modules

Score: I missed one question in the homework and two in the final exam for a combined score of 94%

DAT201x Score     DAT201x Certificate

DAT207x – Analyzing and Visualizing Data with Power BI

I already have experience creating reports using Power BI. I also use Power Query (now called get and transform data) and M language and Power Pivot and DAX language, so this was an easy class.

The course covers data transforms, modeling, visualization, Power BI web service, organization packs, security and groups. It also touches on the developer API and building mobile apps.

Time: 12 hours for 9 modules

Score: I missed one lab question for a combined score of 98%

DAT207x Score     DAT207x Certificate

DAT222x – Essential Statistics for Data Analysis using Excel

This class is comprehensive and covers all the standard statistics and probability topics including descriptive statistics, Bayes rule, random variables, central limit theorem, sampling and confidence interval, and hypothesis testing. Most analysis is conducted using the Data analysis pack add-in for Excel.

Time: I used to work in market research, so I know my statistics. However, there are 36 homework assignments and it took me over 20 hours to complete the 5 modules.

Score: I missed 9 questions on the quizzes (88%) and six in the final exam (81%) for a combined score of 86%. (Despite the time it takes to complete, homework counts very little toward the final grade)

DAT222x Score     DAT222x Certificate

DAT204x – Introduction to R for Data Science

Now we are getting into the meat of the program. R is a functional language. In many ways it is similar to the M language used in Power Query. I was able to quickly learn the syntax and grasp the core concepts.

The course covers vectors, matrices, factors, lists, data frames, and simple graphics.

The lab assignments use DataCamp which has a script window where you write code and a console window that displays results. That makes it easy to debug programs as you write them.

The final exam used an unexpected format. It was timed and consisted of about 50 questions, mostly fill-in-the-blank responses that include code snippets. You are given 4 minutes per question. If you don’t answer within the time limit, it goes to the next question. I completed the test in about 70 minutes, but I ran out of time on several questions, and was exhausted at the end. I’m not convinced that a timed test is the best way to measure subject mastery by a beginning programmer. But maybe that is just rationalization on my part.

Time: 15 hours for 7 modules

Score: I got all the exercises (ungraded) and labs right and missed two questions in the quizzes. I only got 74% on the final, for a combined score of 88%

DAT204x Score     DAT204x Certificate

DAT203.1x Data Science Essentials

The first three modules in this course covered statistics and was mostly a repeat of the material introduced in DAT222x. But the rest of the course provides an excellent introduction to machine learning. You learn how to create a MAML instance, import a SQL query, manipulate it using R or Python, create a model, score it, publish it as a web service, and use the web service to append predictions as a column in Excel. I really like MAML. I will post a review of my experience in a future blog post.

The course was a little too cookbook-like for my taste. It consisted mostly of following directions to drag-drop boxes onto the canvas UI and copy-paste code snippets into the panels. However, if you want a quick introduction to machine learning without having to dig into the details of SQL, R, or Python, this is a great course.

Time: 10 hours for 6 modules

Score: 100% on the 6 labs and the final

DAT203.1x Score     DAT203.1x Certificate

I have now completed six out of the ten courses required for a certificate. I expect to finish the remaining 4 needed for a certificate by the end of the year. I will also probably take some of the other elective courses simply to learn more about Microsoft’s other machine learning and cloud services.

For my results in the remaining classes, see Microsoft Data Science Certificate-Part 2

Update: Modified the description of the final exam for DAT204x.

Advertisements

by George Taniwaki

NASA recently celebrated the fifteenth anniversary of the launching of the Chandra X-ray Observatory by releasing several new images. One of the images, shown below, is an amazing composite that reveals in exquisite detail the turbulence surrounding the remnants of the Tycho supernova. (Scroll down to Figure 2, then come back.)

Tycho supernova

The scientific name of the Tycho supernova remnant is SN 1572, where SN means supernova and 1572 refers to the year it was first observed. That’s right, over 400 years ago, in November 1572, many people noticed a new bright object in the sky near the constellation Cassiopeia. Reports at the time indicated that it was as bright as Venus (peak magnitude of –4) meaning it was visible during the day.

SN 1572 is called the Tycho supernova because a Danish scientist named Tycho Brahe published a paper detailing his observations. His paper is considered one of the most important in the history of astronomy, and science in the Renaissance.

Tycho_Cas_SN1572

Figure 1. Star map drawn by Tycho Brahe showing position of SN 1572 (labelled I) within the constellation Cassiopeia. Image from Wikipedia

What people at the time didn’t know was that SN 1572 was about 9,000 light years away, meaning it was unimaginably far away. The explosion that caused it happened long ago but the light had just reached the earth.

(Actually, SN 1572 is fairly close to us relative to the size of the Milky Way which is 100,000 light years across, and extremely close relative to the size of the observable universe which is 29 billion light years across. Space is just really unimaginably large.)

What they also didn’t know was that SN 1572 was probably a Type 1a supernova. This type of supernova is common, and has a very specific cause. It starts with a binary star system. Two stars orbit one another very closely. Over time, one of the stars consumes all of its hydrogen and dies out, leaving a carbon-oxygen core. Its gravity causes it to accrete the gas surrounding it until its mass reaches what is called the Chandrasekhar limit and it collapses. The increased pressure causes carbon fusion to start. This results in a runaway reaction, causing the star to explode.

About a supernova remnant

In the 400 years since SN1572 exploded, the debris from it has been flying away at 5,000 km/s (3100 mi/s). It is hard to see this debris. Imagine a large explosion on the earth that occurs at night.

The debris itself doesn’t generate very much light, but it does produce some. Space is not a vacuum. It is a very thin gas. When electrons from the moving debris of the supernova remnant strike a stationary particle, it gives off a photon (which depending on the energy of the collision, is seen as radio waves, microwaves, visible light, UV, or x-rays). This energy also heats up the remaining particles, releasing additional photons, making them detectable with a very sensitive telescope.

About false color images

The Chandra X-ray Observatory was launched in 1999 from the space shuttle Columbia. As the name implies, it can take digital images of objects in the x-ray range of light. Since humans cannot see in this range, images taken in the x-ray range are often color coded in the range from green to blue to purple.

Often, composite images of space objects are created using telescopes designed to capture photons from different wavelengths. For instance, visible light telescopes like the Hubble Space Telescope often have the colors in their images compressed to black and white. Images from infrared telescopes, like the Spitzer Space Telescope, and ground-based radio telescopes are often given a false color range between red to orange.

Pictures please

All right, finally the results. Below is the most detailed image ever of the Tycho supernova remnant. It is a composite created by layering multiple, long-exposure, high-resolution images from the Chandra X-ray Observatory. The press release says, “The outer shock has produced a rapidly moving shell of extremely high-energy electrons (blue), and the reverse shock has heated the expanding debris to millions of degrees (red and green).

“This composite image of the Tycho supernova remnant combines X-ray and infrared observations obtained with NASA’s Chandra X-ray Observatory and Spitzer Space Telescope, respectively, and the Calar Alto observatory, Spain.

“The explosion has left a blazing hot cloud of expanding debris (green and yellow) visible in X-rays. The location of ultra-energetic electrons in the blast’s outer shock wave can also be seen in X-rays (the circular blue line). Newly synthesized dust in the ejected material and heated pre-existing dust from the area around the supernova radiate at infrared wavelengths of 24 microns (red).”

Tycho2014

Figure 2. Tycho supernova remnant composite image released in 2014. Image from NASA

Compare Figure 2 above to an image of the Tycho supernova remnant that NASA released in 2009 using data from observations made in 2003 shown below. Notice the lack of details. Also notice the large number of stars in the background, some even shining through the dust of the explosion. Apparently, the image above has been modified to eliminate most of these distractions.

These two images dated only a few years apart reveal what is likely remarkable advances in software for manipulating space images. I say that because the hardware in the telescopes themselves, such as optics, detectors, and transmitters probably have not changed much since launch. Thus, any improvements in resolution and contrast between the two images is a result of better capabilities of the software used to process images after the raw data is collected.

A New View of Tycho's Supernova Remnant

Figure 3. Tycho supernova remnant composite image release in 2009. Image from NASA

by George Taniwaki

Your smartphone is more than an addictive toy. With simple modifications, it can become a lifesaving medical device. The phone can already receive and send data to medical sensors and controllers wirelessly. By adding the right software, a smartphone can do a better job than a more expensive standalone hospital-grade machine.

In addition, smartphones are portable and patients can be trained to use them outside a clinical setting. The spread of smartphones has the potential to revolutionize the treatment of chronic conditions like diabetes. This can enhance the quality of life of the patient and significantly increase survival.

Monitoring blood sugar

Type 1 diabetes mellitus is an autoimmune disease in which the body attacks the pancreas and interrupts the production of insulin. Insulin is a hormone that causes the cells in the body to absorb glucose (a type of sugar) from the blood and metabolize it. Blood sugar must be controlled to a very tight range to stay healthy.

A lack of insulin after meals can lead to persistent and repeated episodes of high blood sugar, called hyperglycemia. This in turn can lead to complications such as damage to nerves, blood vessels, and organs, including the kidneys. Too much insulin can deplete glucose from the blood, a situation called hypoglycemia. This can cause dizziness, seizures, unconsciousness, cardiac arrhythmias, and even brain damage or death.

Back when I was growing up (the 1970s), patients with type 1 diabetes had to prick their finger several times a day to get a blood sample and determine if their glucose level was too low or too high. If it was too low, they had to eat a snack or meal. (But not one containing sugar.)

They would also test themselves about an hour after each meal. Often, their glucose level was too high, and they had to calculate the correct does of insulin to self-inject into their abdomen, arm, or leg to reduce it. If they  were noncompliant (forgetful, busy, unable to afford the medication, fearful or distrustful of medical institutions or personnel, etc.), they would eventually undergo diabetic ketoacidosis, which often would require a hospital stay to treat.

BloodGlucoseTestStrip

Figure 1a. Example of blood glucose test strip. Photo from Mistry Medical

InsulinShot

Figure 1b. Boy demonstrating how to inject insulin in his leg. Photo from Science Photo Library

If all these needle pricks and shots sound painful and tedious, they were and still are. There are better test devices available now and better insulin injectors, but they still rely on a patient being diligent and awake.

Yes, being awake is a problem. It is not realistic to ask a patient to wake up several times a night to monitoring her glucose level and inject herself with insulin. So most patients give themselves an injection just before going to bed and hope they don’t give themselves too much and that it will last all night.

Continuous glucose monitoring

Taking a blood sample seven or eight times a day is a hassle. But even then, it doesn’t provide information about how quickly or how much a patient’s glucose level changes after a meal, after exercise, or while sleeping.

More frequent measurements would be needed to estimate the rate at which a patient’s glucose level would likely rise or fall after a meal, exercise, or sleeping. Knowing the rate would allow the patient to inject insulin before the glucose level was outside the safe range or reduce the background dosage if it is too high.

In the 1980s, the first continuous glucose meters were developed to help estimate the correct background dosage of insulin and the correct additional amounts to inject after snacks and meals.

The early devices  were bulky and hard to use. They consisted of a sensor that was inserted under the skin (usually in the abdomen) during a doctor visit and had wires that connected it to a monitoring device that the patient carried around her waist. The sensor reported the glucose level every five to ten seconds and the monitor had enough memory to store the average reading every five to ten minutes over the course of a week.

The devices were not very accurate and had to be calibrated using the blood prick method several times a day. The patient would also have to keep a paper diary of the times of meals, medication, snacks, exercise, and sleep. After a week, the patient would return to the doctor to have the sensor removed.

The doctor would then have to interpret the results and calculate an estimated required background dose of insulin during the day and during the night and the correct amount of additional injections after snacks and meals. The patient would repeat the process every year or so to ensure the insulin dosages were keeping the glucose levels within the desired range.

Today, continuous glucose monitors can measure glucose levels using a disposable sensor patch on the skin that will stay in place for a week. It transmits data to the monitor wirelessly. Using a keypad, the monitor can also record eating, medication, exercise, and sleeping. The monitor can store months of personal data and calculate the amount of insulin needed in real-time. Alerts remind the patient when to inject insulin and how much. They are cheap enough and portable enough that the patient never stops wearing it.

ContinuousGlucoseMonitor

Figure 2. Wireless continuous blood glucose monitor and display device. Image from Diabetes Healthy Solutions

Continuous insulin pump

Also in the 1980s, the first generation of subcutaneous insulin pumps were commercialized. These pumps could supply a low background dose of insulin rather than big spikes provided by manual injections. The first pumps were expensive, bulky, hard to use. By the early 2000s though, insulin pumps became widely available and were shown to reliably reduce the fluctuations in glucose levels seen in patients who relied on manual injections. By providing a low dose of insulin continuously during the day and at night with the ability of the patient to manually apply larger doses after meals, it lowered the average level of glucose while also reducing the incidence of hypoglycemia. Over longer periods it also reduced the incidence of complications commonly seen with diabetes.

InsulinPumpEarlyinsulinpump

Figure 3a and 3b. Early insulin pump (left) and modern version (right). Images from Medtronic

There is one drawback to the continuous insulin pump. It can provide too much insulin at night while the patient is asleep. While sleeping, the patient’s glucose level falls. Since she is not performing blood tests, she will not notice that the insulin pump is set too high. Further, since she is asleep she may not realize that she is in danger, a condition called nocturnal hypoglycemia.

Software to control the pump

Imagine combining the continuous glucose meter with the continuous insulin pump. Now you have a system the mimics the behavior of the human pancreas. Sensors constantly monitor the patient’s glucose level, and anticipate changes caused by activities like eating, sleeping, and exercise.

The key is to use a well-written algorithm to predict the amount of insulin needed to be injected by the pump to keep sugar levels within the acceptable range. Instead of a human, software controls the insulin pump. If the glucose level does not stay within the desired levels, the algorithm learns its mistake and corrects it.

The initial goal of the combined monitor and pump was to predict low glucose levels while a patient was sleeping and suspend the pumping of insulin to prevent nocturnal hypoglycemia. Ironically, the US FDA panel rejected the first application submitted for the device saying that the traditional uncontrolled continuous insulin pump was actually safer than a new device because of the new device’s lack of field experience.

After years of additional studies the combined device, manufactured by Medtronic, was approved for use in the US in 2013. Results of a study involving 25 patients in the UK was published in Lancet Jun 2014. Another trial, involving 95 patients in Australia was published in J. Amer. Med. Assoc. Sept 2013.

CombinedGlucoseMonitorInsulinPump

Figure 4. Combined glucose meter and insulin pump form a bionic pancreas. Image from Medtronic

Better software and smartphones

The Medtronic combined device is proprietary. But several groups are hacking it to make improvements. For instance, researchers led by Z. Mahmoudi and M. Jensen at Aalborg University in Denmark have published several papers (Diabetes Techn Ther Jun 2014Diabetes Sci Techn Apr 2014, Diabetes Techn Ther Oct 2013) on control algorithms that may be superior to the one currently used in the Medtronic device.

Another interesting paper appeared in the New Engl J Med Jun 2014. It reports a study by Dr. Steven Russell of Massachusetts General Hospital and his colleagues. They wrote an app for a smartphone (Apple’s iPhone 4S) that could receive the wireless data from the Medtronic glucose meter and wirelessly control the Medtronic insulin pump.

Smartphones are ideal platforms for use in developing medical devices because they can communicate wirelessly with other devices, have sufficient computing power and memory for even the most complex control tasks, are designed to be easy to program and easy to use, and many people already own one.

Dr. Russell and his colleagues used a machine learning algorithm they had previously developed (J Clin Endocrinol Metab May 2014) to couple the two.

Even though this is a research project, not a commercial product, the results are pretty impressive. The study lasted 5 days, with the first day used to calibrate the algorithm and days 2-5 as the test.

As can be seen in Figure 5, after a day of “training” patients using the bionic pancreas (solid black line) had lower average glucose levels than patients on the standard protocol (solid red line). Further, the variance of their glucose level (black shaded area) was smaller than for patients on the standard protocol (red shaded area). Notice how much better the control is using the bionic pancreas, especially at night.

InsulinNEJM1

Figure 5. Variation in mean glucose level among adults during 5-day study. Image from New Engl J Med

Another measure of quality is the amount of time the patients’ glucose levels were within the desired level of 70 to 120 mg/dl (the green shaded region in Figure 6). Patients with the bionic pancreas (solid black line) spent about 55% of the time within the desired level. They also had fewer incidents of hypoglycemia (pink shaded region) or hyperglycemia (white region on right) than patients using the standard protocol (red line).

Note that even with the bionic pancreas, 15% of the time patients had a glucose level above 180, so there is still plenty of room to improve control.

InsulinNEJM

Figure 6. Cumulative glucose level in adults during day 1 where the bionic pancreas adapted to the patient (dashed line) and days 2-5 (solid black). Image from New Engl J Med

by George Taniwaki

If you have ever taken a picture through an exterior window, you may have been disappointed with the results. This can happen either standing inside a building shooting out, or standing on the street shooting in. That’s because if the side you are standing on is brighter than the other side you will get reflection of the lights off the glass. And if you are standing in the light, you will also get the reflection of yourself which just looks goofy.

Another problem is that if it is a sunny day, it will be much brighter outside than inside, so all of the objects inside the building will be too dark relative to the objects outside.

Photographer Nick Kelsh has a blog post where he gives a good description of the problem and some tips to avoid it. Though his recommendation to throw a rock through the window may be a bit extreme.

You can try to fix the problem using Photoshop, but it is a lot of work because you will need to fix the color balance for two different parts of the scene. The interior lighting will probably be an artificial source like incandescent (2700K) or fluorescent (5000K) while the exterior will be lit by the sun (~10,000K). Removing the reflections will be harder still.

A better solution, if you can control the interior lights, is to take two pictures and combine them. The first picture is taken with the lights on and is designed to capture the object inside the building. The second picture is taken with the lights off and is designed to capture the view outside the window. (This assumes the lights are inside.)

Then you merge the two images. If the window is rectangular, you don’t even need fancy equipment and software to perform this trick. In the example I show below, I didn’t use a tripod, just a handheld smartphone. To manipulate the images I used  free Google Picasa to adjust the color balance and brightness of the two images. I copied and pasted the image within the window using free Microsoft Paint.

Take two pictures

The first step is to take two pictures out of the window without moving the camera, one with the lights on and one with the lights off. For this example, I am standing inside the house. When taking the picture with the lights on, I set my focus and exposure to capture the most prominent object(s) inside the house. I ignore the reflections in the window (Fig 1).

Conversely, when taking the picture with the lights off, I set my focus and exposure to capture the view though the window. I ignore the underexposure and poor lighting of objects inside the house (Fig 2).

LightsOnLightsOff

Figures 1 and 2. Image with interior lights on (left) and lights off (right)

For best results, you will want to mount the camera on a tripod to eliminate any movement between the two images and any blur caused by long exposure during the lights off photo. However, if the window is rectangular and there is no overlap between objects in the foreground and the window, it doesn’t really matter. In fact, the images don’t even have to be out the same window.

Adjust the brightness and color balance separately

Open each image individually in your favorite photo editor software package. I use Google Picasa because it is free. Set the color balance, brightness, and contrast. I dislike the yellowish tone of photographs taken with incandescent lights and always try to correct them to fluorescent. When taking photos with a low quality camera, such as most smartphones, also apply despeckle and unsharp masking. Save both images.

Copy and paste

You can use Google Picasa to crop an image and copy it to the clipboard. However, it doesn’t appear to have a way to paste the clipboard into an image. Thus, we will merge the two images using free Microsoft Paint. Open the image that has the good view through the window. Create a selection rectangle around the image within the window frame and copy it to the clipboard. Open the image that has the image of objects except the window and paste. If the images don’t overlap properly because they are different sizes, start over and resize the first image to match the second. Save under a new name (so you don’t ruin the original image).

The resulting image (Fig 3) looks pretty good, but it isn’t perfect. First, because I was using a smartphone, the focal length of the lens is very short so the sides of the windows are curved rather than straight. Since I used a rectangular selection to cut and paste, parts of the original window image show through. For instance, you can see the reflection of a light peeking through in the upper right corner of the window.

Second, the gooseneck of the sink faucet is within the rectangular selection so it is too dark.

Finally, since I didn’t use a tripod, there is some movement and the gooseneck of the sink faucet is slightly displaced.

KItchenWindows_13

Figure 3. The merged image, color corrected and cropped

Using Photoshop or other advanced image editor, you can make a better selection. And you don’t have to use the image out the window. Once you have created a mask, you can paste any image into the scene.

All photographs by George Taniwaki

A common task that webmasters are asked to perform is to get bids for hosting a website. When gathering information to prepare a quote, the vendor will often ask what the peak load (in server requests per second) will be. As a webmaster you may well ask, how the heck do I do that?

Estimating page views per month

Estimating peak server requests per second is a four step process. First, we must estimate page views per month. Next, we estimate average page views per second during the heaviest or prime viewing period. Then we estimate peak page views per second during the prime viewing period. Finally, we estimate peak server requests per second during the prime viewing period.

Average page views per month can be obtained by looking at server logs. If server logs are not available or you are creating a new site, then it can be estimated using logs from similar sites. For this blog post, we will assume the website generates an estimated 2.6 million page views per month. (By comparison, this blog generates about 2,000 page views per month, mostly from bots, I think.)

Estimating average page views per prime viewing second

Assume your website that generates 2.6 million page views per month has traffic that is fairly steady all day and all night on every day. That is, of the 730 hours in a month (= 365 days per year / 12 months per year * 24 hours per day) , all of them will be prime viewing hours. In that case, we can calculate the mean page views per second by doing some simple arithmetic.

page views per second = 2.6 million page views per month / 730 hours per month / 3600 seconds per hour = approx. 1 page view per second

But what if traffic to the website isn’t steady. What if people only visit it during work hours? Well, there are about 168 work hours per month compared to about 730 actual hours per month, a ratio of about 4.3 to1. So during prime viewing hours there will be about 4.3 page views per second and 0 page views per second during non-work hours. (This assumes everyone works the same days and same hours regardless of time zone.)

The prime viewing hours for a website can be even more compressed. Let’s say you run a website for NBC and it has a blog that contains a synopsis of the television show Grimm and an update is posted immediately after each new episode airs. In that case, perhaps all of the page views will occur during a 4 hour period starting at 10:00 pm every Friday. Thus, there will be 16 prime viewing hours per month during which there will be 45 page views per second and 0 page views per second during the rest of the month.

The chart below shows the page view distribution for the three cases described above. This model is quite simplified. It can obviously be made more complex by assuming that the prime viewing hour is dependent on time zone, that page views do not drop to zero during the non-prime viewing hours, and having multiple variables that affect page views during a particular hour.

PrimeViewing

Three ways to achieve 2.6 million page views per month. Image by George Taniwaki

Let’s look at the distribution of page views in more detail. In the four-hour prime viewing period case we said that there were 0 page views per second before and after the prime viewing period and an average of 45 page views per second during the prime viewing period. If the number of page views is constant throughout the prime viewing period, then the distribution curve is rectangular as shown by the blue line in the chart below.

But it is unlikely that the change in page view rate is so abrupt. It is more likely that page views rise steadily to a peak and then fall. If the distribution is triangular and spread across four hours, then the average page views at the maximum point will be 90 per second (=45*2) as shown by the brown line below. If the distribution is bell shaped, called the normal distribution, then the average peak page views at the maximum point will be somewhere in between as shown by the green line below.

ProbDist

Three ways to achieve 625,000 page views in an evening. Image by George Taniwaki

One caveat, I tried to draw the curves in the chart above so that all three would have similar variance but didn’t actually do the calculations to verify it.

Estimating peak page views per prime viewing second

All of the work above was to find the average number of page views per second during the prime viewing time. However, visitors to the website will arrive randomly. So we can expect that there will be some fluctuation in the number of page views during a second. Some seconds during the prime viewing time will have fewer than the average number of visitors and some seconds will have more. We can model this random arrival of visitors using the Poisson distribution.

Since the arrival of visitors will be random, we cannot estimate the maximum number of visitors the website will ever receive in a second. That number is actually infinite. But we can estimate it for a variety of confidence levels, such as 90%, 99%, and even 99.999% (the so-called five 9s availability level). In this case confidence level indicates the proportion of one second intervals that will be below the peak.

Using Excel’s Poisson distribution function we can estimate the ratio between peak page views per second to average page views per second at various confidence levels. The results are shown in the three tables below. Notice that although the average page views per second can be a fraction, the peak page views per second is always an integer.

Avg. page views per month

Avg. page views per second at max. point*

Peak page view per second at 0.9 confidence level

Ratio of peak to average at max. point

1

0.00000165

0 or 1

0 or 604,800

1,000

0.00165

0 or 1

0 or 605

1,000,000

1.65

3

1.81

2,600,000

4.30

7

1.63

1,000,000,000

1653

1706

1.03

Avg. page views per month

Avg. page views per second at max. point*

Peak page view per second at 0.99 confidence level

Ratio of peak to average at max. point

1

0.00000165

0 or 1

0 or 604,800

1,000

0.00165

0 or 1

0 or 605

1,000,000

1.65

5

3.0

2,600,000

4.30

10

2.3

1,000,000,000

1653

1749

1.06

Avg. page views per month

Avg. page views per second at max. point*

Peak page view per second at 0.99999 confidence level

Ratio of peak to average at max. point

1

0.00000165

0 or 1

0 or 604,800

1,000

0.00165

1

605

1,000,000

1.65

9

5.4

2,600,000

4.30

16

3.7

1,000,000,000

1653

1830

1.11

*Assumes 168 prime viewing hours per month with uniform distribution

Also notice that when the average page views per second is low, the peak page views per second can have two solutions, 0 or 1. These cases occur when the average page views per second is below 1- confidence level. For instance, if all you care is that your web server can handle all the traffic 99% of the time, and your average traffic is less than 0.01 page views per second you don’t need a web server at all! That’s because 99% of the time (during the prime viewing period), there is no traffic to your website.

However, if your goal is to be able to serve 99% of your visitors during the peak viewing time, then you need a web server than can deliver at least one page view per second. And if you provide a web server to deliver one page every 100 seconds, your ratio of peak to average will be 100. Providing a complete web server ready to serve the rare visitor results in tremendous overhead costs, which is why cloud computing, where resources are shared among many websites, is becoming so popular.

Finally, notice that when the average page views per second is high (say 1,000) , then the ratio of peak to average is close to 1 and does not grow very much even at high availability (or confidence) levels. At high levels of average page views, the error in estimating the average number of page views per second is likely to be much greater than the error introduced by ignoring the random distribution of page views per second.

Estimating peak server requests per prime viewing second

Our last step is to estimate the number of server calls generated by a single web page request from a user. A typical web page consists of a static html file plus one or more images, videos, ads, and JavaScript widgets displayed on the page. (In the case of dynamic pages, the content will be generated on the server, usually as a jsp file, or a aspx file if you are using the Microsoft .NET Framework.) If the page is not cached, sending all of the page contents may take over 100 requests to the server.

Assume your website consists of a single page that contains 100 items and all of those items reside on a single server. Now assume a single user calls for that page and you don’t want the user to have to wait more than one second before being able to interact with any part of it. That means the web server will need to be able to handle at least 100 requests per second per page view. (There are other potential bottlenecks in rendering the web page including Internet traffic, ISP speed, and the speed of the client computer, but we’ll ignore these for purposes of this blog post.)

Using the five nines confidence level, the final results for page views and server calls are shown in the table below. For our website with an expected 2.6 million page views per month, we need a web server that can handle 1,600 requests per second.

Avg. page views per month

Avg. server requests per month*

Peak page view per second at 0.99999 confidence level**

Peak server requests per second at 0.99999 confidence level*,**

1

100

1***

100***

1,000

100,000

1

100

1,000,000

100,000,000

9

900

2,600,000

260,000,000

16

1,600

1,000,000,000

100,000,000,000

1830

183,000

*Assumes 100 server calls per page view
**Assumes 168 prime viewing hours per month with uniform distribution
***Assumes goal is to satisfy 99.999% of visitor requests, not 99.999% of time

*****

Update: If your website is one of many hosted on a single server, then you should skip the calculations for estimating peak page views per prime viewing second. That’s because traffic to your site will be combined with traffic to other sites. In that case, it is up to the company hosting the sites to combine the average traffic from all the sites first, then calculate the peak page views based on their promised availability.

Nearly every state in the U.S. maintains a registry of people willing to become deceased organ donors. The intent of an individual to be a donor is stored as a Boolean value (meaning only yes or no responses are allowed) within the driver’s license database. Nearly all states use what is called an opt-in registration process. That is, the states start with the assumption that drivers do not want to participate in the registry (default=no) and require them to declare their desire (called explicit consent) to be a member of the registry either in-person, via a website, or in writing.

One of the frequent proposals to increase the number of deceased organ donors is to switch the registration of donors from an opt-in system to an opt-out system. In an opt-out system, all drivers are presumed to want to participate (default=yes) and people who do not wish to participate must state their desire not to be listed.

Let’s look at the logical and ethical issues this change would present.

Not just a framing problem

Several well-known behavioral economists have stated that switching from opt-in to opt-out is simply a framing problem. For instance, see chapter 11 of Richard Thaler and Cass Sunstein’s book Nudge and a TED 2008 talk by Dan Ariely using data from papers by his colleagues Eric Johnson et al., in Transpl. Dec 2004 and Science Nov 2003 (subscription required).

The basic argument is that deciding whether to donate organs upon death is cognitively complex and emotionally difficult. When asked to choose between difficult options, most people will just take the default option. In the case of an opt-in donor registration, this means they will not be on the organ donor registry. By switching to an opt-out process, the default becomes being a donor. Thus, any person who refuses to make an active decision will automatically become a registered organ donor (this is called presumed consent). This will increase the number of people in the donor registry without causing undue hardship since drivers can easily state a preference when obtaining a driver’s license.

However, these authors overlook two important practical factors. First, switching from opt-in to opt-out doesn’t just reframe the decision the driver must make between two options. It will actually recategorize some drivers.

Second, it changes the certainty of the decision of those included in the organ registry, which affects the interaction between the organ recovery coordinators at the organ procurement organization (OPO) and the family member of a deceased patient.

There are more than two states for drivers regarding their decision to donate

Note that the status of a driver’s intent to be an organ donor is not just a simple two-state Boolean value (yes, no). There are actually at least three separate states related to the intension to be an organ donor. First, upon the driver’s death, if no other family members would be affected, would she like to be an organ donor (yes, no, undecided). Second, has she expressed her decision to the DMV and have it recorded (yes, no). Finally, would she like her family to be able to override her decision (yes, no, undecided). The table below shows the various combinations of these variables.

Category

Driver would like to be organ donor
Driver tells DMV of decision
Driver would permit family to override decision

Comment

1a Yes Yes No Strong desire
1b Yes Yes Yes or Undecided Weak desire
2a No Yes Yes or Undecided Weak reject
2b No Yes No Strong reject
3a Yes No Yes, No, or Undecided Unrecorded desire
3b No No Yes, No, or Undecided Unrecorded reject
4 Undecided Yes or No Yes* Undecided

*No or Undecided options make no sense in this context

Opt-in incorrectly excludes some drivers from the donor registry

Now let’s sort these people into two groups, one that we will call the organ donor registry and the other not on the registry.

Under the opt-in process, only drivers in categories 1a and 1b are listed on the organ registry. These drivers have given explicit consent to being on the registry. Drivers in categories 2a, 2b, 3a, 3b, and 4 are excluded from the registry. Thus, we can be quite certain that everyone on the registry wants to be a donor. (There is always a small possibility that the driver accidentally selected the wrong box, changed their mind between the time they obtained their driver’s license and the time of death, or a computer error occurred.)

In most states the drivers not on the organ registry are treated as if they have not decided (i.e., as if they were in the fourth category). When drivers not on the registry die under conditions where the organs can be recovered, the families are asked to decide on behalf of the deceased.

Under an opt-in process, drivers in category 2a are miscategorized. They don’t want to be donors and didn’t want their family to override that decision, but the family is still allowed to decide. The drivers in categories 3a and 3b are miscategorized as well. The ones who don’t want to be donors (3b) are also forced to allow their families to decide. The ones who want to be donors (3a) are now left to let their families decide.

Opt-out incorrectly includes some drivers in the donor registry

Under an opt-out process, drivers in categories 1a, 1b, 3a, 3b, and 4 are grouped together and placed on the organ registry. If the donor registry is binding and the family is not allowed to stop the donation, then the process is called presumed consent. (Note that many authors use opt-out and presumed consent interchangeably. However, they are distinct ideas. Opt-in is a mechanical process of deciding which driver names are added to the registry. Presumed consent is a legal condition that avoids the need to ask the family for permission to recover the organs.)

Drivers in category 3a who wanted to be registered are now correctly placed on the registry. But any drivers in category 3b who don’t want to be on the registry are now assumed to want to be donors, a completely incorrect categorization. Similarly, all drivers in the fourth category who were undecided are now members of the definite donor group and the family no longer has a say.

Only drivers in category 2a and 2b are excluded from the registry. We can be quite certain these people do not want to be donors. But some (category 2a) were willing to let the family decide. Now they are combined with the group of drivers who explicitly do not want to donate.

The distribution of categories into the registry under the opt-in and opt-out process and how they are treated are shown in the table below.


Categories added to donor registry
Categories not added to donor registry

Implications

Opt-in process 1a, 1b both treated as if in category 1a (explicit consent) 2a, 2b, 3a, 3b, 4 all treated as if in category 4 (family choice) Drivers in registry are nearly certain to want to be donors. Actual desire of drivers not on registry is ambiguous
Opt-out process 1a, 1b, 3a, 3b,4 all treated as if in category 1a (presumed consent) or 1b (family choice) 2a, 2b both treated as if in category 2b (explicit reject) Drivers not in registry are nearly certain to not want to be donors. Actual desire of drivers on registry is ambiguous

 

Ethical implications of misclassification

If there are no drivers in categories 3a, 3b, and 4, then switching from opt-in to opt-out will have no impact on the size of the donor registry. However, if there are any drivers in these categories, then some will be incorrectly categorized regardless of whether opt-in or opt-out is used. This miscategorization will lead to some ethical problems.

Under opt-in, there may exist cases where the drivers has made a decision to donate (category 3a) or not (categories 2a or 3b) but family members overrules it. These errors are hard to avoid because they are caused by the lack of agreement between the drivers and other family members.

However, under opt-out combined with presumed consent, there may exist cases where neither the driver (category 3b) nor the family want to donate, but cannot stop it. Similarly, the driver may want to let the family choose whether to donate (category 4) and the family does not want to donate but cannot stop it.

It appears that from an ethical perspective, opt-in is less likely to create a situation where the respect for individual’s right to make decisions about how the body should be treated is denied. For further discussion of the ethical issues see  J. Med. Ethics Jun 2011, and J. Med. Ethics Oct 2011 (subscription required).

Next we will look at the impact switching from opt-in to opt-out will have on the interaction between the organ recovery coordinator and the family. See Part 2 here.

[Update: This blog post was significantly modified to clarify the “decision framing” issue.]

by George Taniwaki

In a May 2010 blog post, I featured a new social media web site called PatientsLikeMe which allows patients to compare medical notes with each other. It also allows patients to search for each other and provides a forum where they can create a sense of community.

In February, PatientsLikeMe upgraded its site and added a bunch of new features including automatically adding patients to forums with other patients with similar conditions. They’ve also made it easier to measure and track how you are feeling and share that information with others.

If you are a kidney patient, especially a transplant recipient, I encourage you to check out the site.

PatientsLikeMeConditions

New condition tracking page. Image from PatientsLikeMe

****

One thing that PatientsLikeMe is missing is the ability to communicate with other patients in real-time, using chat. To address this need for instant feedback from fellow patients there is a site called HealCam.

HealCam was developed by a California anesthesiologist named Michael Ostrovsky. Dr. Ostrovsky modeled HealCam on ChatRoulette. The difference is that ChatRoulette seems to be a rather pointless game while HealCam is intended to be a tool to help both participants cope with their condition.

One problem with HealCam is that even though it has been around for almost a year, very few people have heard of it. For instance, when I signed on, I was the only participant. Unless HealCam can increase usage, it will languish. Perhaps it should not be a standalone operation and just become a feature within a site such as PatientsLikeMe.

HealCam

The HealCam user interface is clean and simple. Image from HealCam