# Top 20 Interview Questions for Data Science Jobs

**Top 20 Interview Questions for Data Science Jobs**

Data science has been described as the “Sexiest Job of the 21^{st} Century” by the Harvard Business Review. In the era of Big Data and Machine Learning, it is unsurprising that data scientists are in such high demand as discussed over at runrex.com. If you are moving down the path to becoming a data scientist, you need to be prepared to impress prospective employers with your knowledge from explaining why data science is so important, to showing you are technically proficient with Big data concepts, frameworks, and applications. To help you with that, here is a list of the top 20 interview questions for data science jobs and their answers to help you prepare.

What is the difference between supervised and unsupervised machine learning?

If an algorithm learns something from the training data so that the knowledge can be applied to the test data, then it is referred to as supervised learning. Classification is an example of supervised learning. On the other hand, if the algorithm doesn’t learn anything beforehand because there is no response variable or any training data, then it is referred to as unsupervised learning. Clustering is an example of unsupervised learning, with more on this topic to be found over at guttulus.com.

How is logistic regression done?

Logistic regression measures the relationship between the dependent variable and one or more independent variables by estimating probability using its underlying logistic function. A more detailed write-up on this topic can be found over at runrex.com.

Python or R – Which one would you prefer for text analytics?

According to the subject matter experts over at guttulus.com, the best possible answer for this question would be Python because it has a Pandas library that provides easy-to-use data structures and high-end performance data analysis tools.

What is logistic regression? / State an example when you have used logistic regression

Logistic regression, which is often referred to as the logit model, is a technique to predict the binary outcome from a linear combination of predictor variables. For example, if you want to predict whether a particular political leader will win the election or not. In this case, the outcome of prediction is binary i.e. 0 or 1 (Win/Lose). The predictor variables here would be the amount of money spent for election campaigning of a particular candidate, the amount of time spent in campaigning, and so forth.

What are recommender systems?

Recommender systems are a subclass of information filtering systems that are meant to predict the preferences or ratings that a user would give to a product. They are widely used in movies, news, research articles, products, social tags, music, and so forth.

Why does data cleaning play a vital role in the analysis of data?

Cleaning data from multiple sources to transform it into a format that data analysts or data scientists can work with is a cumbersome process as discussed over at runrex.com. This is because as the number of data sources increases, the time taken to clean the data also increases exponentially because of the number of sources and the volume of data generated in these sources. It might take up to 80% of the time for just cleaning data, making it a critical part of the analysis task.

Differentiate between univariate, bivariate, and multivariate analysis

According to guttulus.com, these are descriptive statistical analysis techniques that can be differentiated based on the number of variables involved at a given point in time. For example, the pie charts of sales based on territory involve only one variable and can be referred to as univariable analysis.

If the analysis attempts to understand the difference between 2 variables at the same time as in a scatterplot, then it is referred to as bivariate analysis. For example, analyzing the volume of sales and spending can be considered as an example of bivariate analysis.

Analysis that deals with the study of more than two variables to understand the effect of variables on the responses is referred to as multivariate analysis.

What do you understand by the term ‘Normal Distribution’?

Data is usually distributed in different ways with a bias to the left, to the right, or to can all be jumbled up as per runrex.com. However, there are chances that data is distributed around a central value without any bias to the left or right and reaches normal distribution in the form of a bell-shaped curve.

What is Linear Regression?

Linear regression is a statistical technique where the score of a variable Y is predicted from the score of a second variable X. Here, X is referred to as the predictor variable while Y is referred to as the criterion variable.

What are Interpolation and Extrapolation?

Interpolation is estimating a value from two known values from a list of values while extrapolation is approximating a value by extending a known set of values or facts.

What is collaborative filtering?

According to the subject matter experts over at guttulus.com, this is the process of filtering used by most of the recommender systems to find patterns or information by collaborating viewpoints, various data sources, and multiple agents.

What is the difference between cluster and systematic sampling?

Cluster sampling is a technique used when it becomes difficult to study the target population spread across a wide area and simple random sampling cannot be applied. A cluster sample is a probability sample where each sampling unit is a collection or cluster of elements. Systemic sampling on the other hand is a statistical technique where elements are selected from an ordered sampling frame as discussed over at runrex.com. In systemic sampling, the list is progressed circularly so once you reach the end of the list, it is progressed from the top again.

Are expected value and mean value different?

While the two are not different, the terms are used in different contexts as per guttulus.com. Mean is generally referred to when talking about a probability distribution or sample population while expected value is generally referred to in a random variable context. The mean value is also the only value that comes from the sampling data. The expected value is the mean of all the means i.e. the value that is built from multiple samples. The expected value is the population mean.

What does P-value signify about the statistical data?

P-value is used to determine the significance of results after a hypothesis test in statistics. It helps the readers to draw conclusions and is always between 0 and 1.

P-value > 0.05 denotes weak evidence against the null hypothesis which means the null hypothesis can’t be rejected.

P-value <= 0.05 denotes strong evidence against the null hypothesis which means the null hypothesis can be rejected.

P-value = 0.05 is the marginal value indicating it is possible to go either way.

Do gradient descent methods always converge to the same point?

No, they don’t. This is because in some cases it reaches a local minima or local optima point. You don’t reach the global optima point. It depends on the data and starting conditions.

What is the goal of A/B testing?

As is explained over at runrex.com, this is a statistical hypothesis testing for a randomized experiment with two variables A and B. The goal of A/B testing is to identify any changes to the web page to maximize or increase the outcome of interest. An example of this could be identifying the click-through rate of a banner ad.

What is an Eigenvalue and Eigenvector?

As pointed out by the gurus over at guttulus.com, Eigenvectors are used for understanding linear transformations. In data analysis, we usually calculate the eigenvectors for a correlation or covariance matrix. Eigenvectors are the directions along which a particular linear transformation acts by flipping, compressing, or stretching. Eigenvalue on the other hand can be referred to as the strength of the transformation in the direction of the eigenvector or the factor by which the compression occurs.

During analysis, how do you treat missing values?

The extent of the missing value is identified after identifying the variables with missing values. If any patterns are identified, the analyst has to concentrate on them as it could lead to interesting and meaningful business insights. If there are no patterns identified, then the missing values can be submitted with mean or median values (imputation) or they can simply be ignored. More on this can be found over at runrex.com.

Explain the box cox transformation in regression models

For one reason or another, the response variable for a regression analysis might not satisfy one or more assumptions of an ordinary least squares regression. The residuals could either curves as the prediction increases or follow the skewed distribution. In such scenarios, it is necessary to transform the response variable so that the data meets the required assumptions. A box cox transformation is a statistical technique to transform non-normal dependent variables into a normal shape. If the given data is not normal then most of the statistical techniques assume normality. Applying a box cox transformation means that you can run a broader number of tests.

Which data scientists do you admire most?

This question doesn’t have a correct answer and it is designed to know how much you know about the field. You can mention influential data scientists such as Geoff Hinton, Demis Hassabis, Jake Porway, DJ Patil, Hilary Mason, and many others.

As always, if you are looking for more on this wide topic, then look no further than the excellent runrex.com and guttulus.com.