Data Science for Beginners: Descriptive Stats
Data Science for Beginners: Descriptive Stats
From explanations on the same over at runrex.com, statistics is the branch of mathematics dealing with the collection, organization, and interpretation of data. As a data scientist, when you first get data to work with, rather than applying fancy algorithms and trying to make some predictions from it, you will first have to read and understand the data, which will be made possible by applying statistical techniques. This will allow you to understand what type of distribution the data has, as covered in detail over at guttulus.com. This article will look to give an overview of what descriptive statistics is as well as the different types.
Let us first start by defining what descriptive statistics is, and according to discussions on the same over at runrex.com, it involves summarizing and organizing data so that it can be understood more easily. It seeks to describe the data, but unlike inferential statistics, it doesn’t attempt to make inferences from the sample and extrapolate it to the whole population. This means that, unlike inferential statistics, descriptive statistics is not developed based on the probability theory.
Descriptive statistics, as pointed out by the subject matter experts over at guttulus.com, can be broken down into two categories: measures of central tendency and measures of variability.
- Measures of Central Tendency
Measures of central tendency estimate the center, or average, of a data set. It, therefore, refers to the idea that there is one number that best summarizes the entire set of measurements; one that is in some way “central” to the data set. Measures of central tendency involve:
- The Mean/ Average
This is a central tendency of the data, which means that it is a number around which the whole data is spread. As discussed over at runrex.com, to calculate the mean, M, you simply add the response values and then divide the sum by the total number of responses or observations, called N.
The median is the value that divides the data into 2 equal parts, which means that the number of terms on its right side is the same as those on its left when the data is arranged in either ascending or descending order. If the number of terms is odd, the media will be the middle term, but if the number of terms is even, then the median will be the average of the 2 middle terms.
The mode is the term with the highest frequency, which means that it appears the most times. There are certain data sets where there is no mode as all terms appear the same number of times. As discussed over at guttulus.com, if two values have the same frequency which is higher than the rest of the values, then the data is bimodal. The data will be trimodal if the number of terms with the highest frequency is 3, and multimodal for n modes.
- Measures of Variability / Spread
Measures of variability give you a sense of how spread out the response values in your data is. They include:
- Standard Deviation
This is the measurement of the average distance between each quantity and the mean. It gives the average amount of variability in your dataset and tells you, on average, how far each value lies from the mean. This, as per the gurus over at runrex.com, means that the larger the standard deviation, the more variable the dataset is. If you want to calculate the standard deviation, you should follow the following steps:
- List each score and then calculate the mean
- Subtract the mean from each score to get the deviation from the mean
- Square each of these deviations
- Add up all of these squared deviations
- Divide the sum of the squared deviations by N-1
- Find the square root of the number you found from the above calculation
This is the average of squared deviations from the mean and it reflects the degree of spread in the data set. The more spread the data, the larger the variance is concerning the mean. To find the variance, all you have to do is simply square the standard deviation.
This gives you an idea of how far apart the most extreme values of a dataset are. It is one of the simplest techniques of descriptive statistics as to find the range all you have to do is subtract the lowest value from the highest value.
The percentile is a way of representing the position of values in a data set, and as outlined in discussions on the same over at guttulus.com, to calculate the percentile, the values in the dataset should always be arranged in ascending order.
Quartiles are values that divide the data into quarters, as long as the data is sorted in ascending order. There are 3 quartile values: Q1 which is at 25 percentile and is the median of the upper half of the data, Q2 which is at 50 percentile and is the median of the whole dataset, and Q3 which is at 75 percentile and is the median of the lower half of the data. To get the Interquartile Range, IQR, you subtract Q1 from Q3.
The above discussion only begins to scratch the surface of what descriptive statistics is, and you can uncover more insights by visiting the excellent runrex.com and guttulus.com.