The Sky is BLUE (Best Linear Unbiased Estimator): A Beginning of endless journey towards data science

Kowshik Sarker
4 min readJun 29, 2021
Image Source: https://www.worldatlas.com/r/w960-q80/upload/56/c5/7c/shutterstock-520792630.jpg

An ever-wondering question in our childhood…” Why Sky is Blue?”. Quest begins there. After a long journey of understanding the fundamental science of how our brain capture the frequency of each and every color generated from the scattered ray of light from Sun, we understood that the reason behind this endless blue is hidden in the beautiful natural process in which we human being are also a part of. So, it starts from the Sun, pure white light rays come down to our atmosphere. Hit the particulates in the atmosphere which are in size very near to the frequency of the blue color and reflect back and thus our eyes recognize…. WoW!! the sky is blue intaking those reflected blue frequencies.

Now, why this phenomenon is so relevant with today’s so-called jargon “DATA SCIENCE”? If you see in a comparative way, data science begins with the question “What?” and move towards the answer “Why?”. In this case also the “What?” is “Sky is Blue” and “Why?” is “Why Sky is Blue”. Now, the answer itself gives you the architecture of the quest.

“Sun” is the source, and so the Data is the source in any Data Science quest. “Atmosphere” is the Model and the “Particulates” are the parameters or independent variables in the model which are responsible for the outcome. “Blue” is the outcome thus the dependent variable which we enjoy through our eye with eternal bliss.

So, in this context why BLUE is so significant? To me, any quest begins within yourself and it applies to data science also. Example time!!, suppose you are observing for a long period of time that your savings is declining day by day. But you are also earning good amount of money (“Good” is an assumption here!! :P :P). Then why savings is becoming less? It surely because of increasing expenditure. But wait, there are lots of expenditure in our daily life…. some are systematic and some are non-systematic or ad hoc basis. Then, which expenditures are much likely to contribute towards my increasing expenses? Now that’s the point where you start to jot down the relationship between each of these points. Income -> Expenditure ->Savings. Here comes the most fundamental concept of data driven decision making, i.e. REGRESSION or OLS Regression. This powerful tool gives you the freedom to identify the causal relationship between each continuous component and will give you a beautiful equation/model having coefficient as BLUE (Best Linear Unbiased Estimator).

Image Source : https://datascience.foundation/img/pdf_images/understanding_of_linear_regression_with_python_1.png

First it is important to understand the concept of the Linear Regression before blindly applying the algorithm (I would rather say this as a fundamental mathematical model ever exists) anywhere. One has to understand the classical 6 assumptions of OLS Regression which creates base of any analytical framework. Now here is a catch, there are two schools of thought exists in the world of Data Science.

1. Best/Correct prediction is utmost important

2. Good prediction with logic behind is of utmost important.

Case 1 is of for most of the ML engineers or scientists. Case 2 is of Classical Statisticians or Econometricians. But as per me to be a good data scientist one should have the knowledge of both, where train-test-split drive the quest and in other hand causal impact of variable based on statistical significance derive the stochastic outcome. Now in both the way error will be generated, and in case 1 it’s the utmost goal to minimize the error as much as possible, but in case 2 error minimization is not the utmost goal whereas to find the best possible independent variables with good estimator is of utmost goal. This way differentiates one from another. But we shouldn’t be biased and lean towards any one method. We should pickup the way depending upon the scenario.

Here is a simple linear regression example I am presenting in Python using Statsmodels. Hope you will enjoy it and will be able to figure it out when to use what techniques. It comes from intuition sometimes what I believe. Sometimes I bit prone towards the econometrics method as it gives me freedom and control over the variables to do the statistical analysis and inferences which I miss in machine learning techniques, i.e. black box techniques. But we should be rational while deciding the techniques and moreover we must have clear understanding about the fundamental concept behind every model and scenario that we are solving.

At the end here is a charismatic quote for you to think of !!

“Regression is like woman. You never understand it fully. When you think you do, the next model is entirely different.” — Dakshitha Wickramasinghe.

Happy Learning!! 😊

--

--