This is the toughest part of DataScience and many aspirants get overwhelmed by vast topics and concepts in statistics.

I can broadly divide all statistical topics into two components.

šŸ”¹Basic Statistics – This includes all the statistical analysis that we do before modeling. Like Variability, Hypothesis testing, probability, etc.

šŸ”¹Statistics in Machine Learning Algorithms – All ML algorithms we use for modeling are based on Mathematics and statistics. For the specific ML, you are using you should be aware of the statistical concepts.

  1. InĀ basic statistics, what we generally do is when we get the data the first time in our project we do some quick analysis of the data we checked the distribution, the central tendency, Max-Min, variability, and any outliers. what is the distribution of the variables we try to have questions about the data to be answered?

The main task that requires basic statistics pre-modeling or any analysis is

1. Exploratory data analysis

2. Chi-Square testing

3. Proportional test

4. Feature engineering- We make the independent variables and dependent variables

5. Data Cleaning – Missing Value and Outlier Treatment

6. Sampling the data

Here is a mindmap I created that talks about basic statistics we use day to day in our projects

The form you have selected does not exist.

There are mainly 4 sections in Basic Statistics and Maths

1. Probability

2. Distributions

3. Estimations

4. Inferences

We try to understand what is the business problem we are solving with the data, we make the independent variables and dependent variables to see how we can sample the data we do some statistical tests to see whether the samples we created are statistically significant then we finalize the data for the model

All the activities here can be generally classified as basic before we are running the machine learning algorithm

If you want to self-study, Here is the link toĀ Inferential StatisticsĀ courses from Massive Open Online Courses


2. InĀ Statistics for Machine Learning Algorithms, We should focus on one Machine Learning algorithm, for regression problems and one for classification and know that very well. For others, you can just know important points and build on them as you do more projects.

For any Machine Learning algorithm, we implement. The minimal requirement is to know the assumptions for that algorithm, and then, we have to be able to statistically test those assumptions.

For example, for multiple linear regression, we should test normality, autocorrelation, multicollinearity, and homoscedasticity.

For example, for logistic regression, we should know the confusion matrix, recall, precission and AUC.

This can go advanced as per the complexity of the algorithm, but learn as you are doing the project.

[wpecpp name=”Buy me a Coffee” price=”$10″]

Leave a Reply

Your email address will not be published. Required fields are marked *

Need help?