Data Science for Beginners: Inferential Stats

tony

4 years ago

Data Science for Beginners: Inferential Stats

As is discussed in detail over at runrex.com, descriptive statistics provide information about one’s immediate group of data such as the mean and standard deviation. Any group of data that includes all the data one is interested in is called a population, which can be small or large. For example, if you are interested in the height readings of 100 of your colleagues, the 100 colleagues you pick will be your population. Descriptive statistics, as explained over at guttulus.com, are applied to populations, and the properties of populations such as the mean and standard deviation are known as parameters since they represent the whole population. In most instances, however, as a data scientist, you won’t have access to the whole population you are interested in investigating, but instead only a limited amount of data. For example, you may be interested in the exam marks of all students in the US for a particular test, and given that it is not feasible to measure all exam marks of all the whole of the US, you will have to measure a smaller sample of students which will be used to represent the larger population. Properties of samples like the mean or standard deviation are not called parameters, but statistics. Inferential statistics are the techniques that allow us to use these samples to make generalizations about the populations from which the samples were drawn, something this article will look to highlight in a bit more detail.

As already mentioned above, and covered in detail over at runrex.com, inferential statistics is about using data from a sample and then making inferences about the larger population from which the sample is drawn. The goal here is to draw conclusions from a sample and then generalize them to the population. This means that, unlike descriptive statistics, inferential statistics is developed using probability theory, from which it can determine the probability of the characteristics of the sample. Given the inferential statistics uses samples to make inferences about the population, it is important that the samples accurately represent the population, which is achieved through a process known as sampling. As is revealed in discussions on the same over at guttulus.com, inferential statistics arise from the fact that sampling will naturally incur sampling error, which means that a sample will not be expected to perfectly represent the population. There are two main areas of inferential statistics:

Estimating parameters

This simply means taking a statistic from your sample data, like say, the sample mean, and then using it to say something about a population parameter, that is, the population mean in this instance.

Hypothesis tests

The testing of the statistical hypothesis involves the use of sample data to answer research questions as explained in detail over at runrex.com. For example, if you are interested in finding out if a new drug is effective, or if a lack of sleep affects performance at work. In hypothesis testing, the main aim is usually to reject the null hypothesis, which is a statement which denies that there is a statistical difference between the status quo and the experimental condition

As pointed out by the gurus over at guttulus.com, all inferential statistics procedures seek to determine if the sample characteristics are sufficiently deviant from the null hypothesis to justify rejecting it. The procedure for performing an Inferential Test includes the following steps:

Start with a theory
Come up with a research hypothesis
Operationalize the variables
Identify the population to which the study results should apply
Form a null hypothesis for this population
Collect a sample from the population and run the study
Perform statistical tests to determine if the obtained sample characteristics are sufficiently different from what would be expected under the null hypothesis to justify rejecting the null hypothesis.

There are several reasons to use inferential statistics as highlighted in discussions on the same over at runrex.com, and they include:

It allows analysts to generalize findings to the larger population.
It can be used to determine not just what can happen, but also what tends to happen in programs.
It helps assess the strength of the relationship between independent variables, also known as casual variables, and dependent variables, which are also known as effect variables.
It can also be used to determine the strength of relationships within a given sample.

As mentioned earlier on, sampling will naturally incur some error, and as revealed in discussions on the same over at guttulus.com two sources of error may result in samples being different from the populations from which they are drawn, and they include: sampling error and sampling bias. This is why inferential statistics has got some limitations; two main limitations in fact, which are:

The main limitation when it comes to inferential statistics is that you will be providing data about a population that you have not fully measured, which means that you can never be completely sure that the values or statistics you calculate are correct.
The second limitation is also connected with the first one and is brought about by the fact that some, but not all, inferential tests require you to make educated guesses, based on theory, to run the inferential tests. This will also bring about some uncertainty which will bring some element of doubt to the results of some of the inferential statistics calculated.

The above discussion is just a speck in the ocean of the information available on this topic, and you can uncover more insights by visiting the excellent runrex.com and guttulus.com.