1 point
Q.2. Where did Aditi Mittal feel
'at home'? *
O Library
O Garden
O Shopping Mall
Answers
Answer:
Get started
Open in app
Why The Central Limit Theorem in Data Science?
Today I’ll be discussing what the central limit theorem (or CLT) is and why is it important for every data science enthusiast to know.
Aditi Mittal
Aditi Mittal
Apr 28, 2020·4 min read
Formal Definition
The central limit theorem states that for a given dataset with unknown distribution, the sample means will approximate the normal distribution.
In other words, the theorem states that as the size of the sample increases, the distribution of the mean across multiple samples will approximate a Gaussian distribution. But for this theorem to hold true, these samples should be sufficient in size. The distribution of sample means, calculated from repeated sampling, will tend to normality with the increase in size of these samples.
Let’s go for basics first
To understand this theorem more clearly, let’s cover the basics first. I’ll be discussing in brief about the histograms and standard normal distribution.
Histograms
Histograms are very simple chart type tool used by every data scientist, mostly to understand and visualise the distribution of a given dataset.
A histogram represents the number of occurrences on the y-axis for different values of a variable(say, weight of individuals), found on the x-axis as shown in the given figure.
This depiction makes it easy to visualize the underlying distribution of the dataset, and understand other properties such as skewness and kurtosis. In histograms, it is important to keep in mind the number of bins and try to have same-width bins as well for ease of interpretation.
Standard Normal Distribution
The standard normal distribution or bell curve is a special case of the normal distribution. It is the distribution that happens when a normal random variable has a mean of zero and a standard deviation of one.
The normal random variable of a standard normal distribution is called a standard score or a z score. Every normal random variable X can be transformed into a z score via the following equation:
z = (X — μ) / σ
where X is a normal random variable, μ is the mean, and σ is the standard deviation.
Assumptions Behind the Central Limit Theorem
It’s important to understand the assumptions behind this theorem:
The data must follow the randomization condition. It must be sampled randomly.
Samples should be independent of each other. One sample should not influence the other samples.
Sample size should be no more than 10% of the population when sampling is done without replacement.
The sample size should be sufficiently large. When the population is skewed or asymmetric, the sample size should be large. If the population is symmetric, then we can draw small samples as well.
The central limit theorem has important implications in applied machine learning. The theorem does inform the solution to linear algorithms such as linear regression, but not complex models like artificial neural networks that are solved using numerical optimization methods. Instead, we must use experiments to observe and record the behaviour of the algorithms and use statistical methods to interpret their results