Data_Science_interview_questions

data_science_topics

For More Important questions - Data Science Questions

Interview Questions - Data Scientist Positions ( Entry, Mid & Senior)

This repository contains the curated list of topic wise questions for Data Scientist Positions in various companies.

⭐ - Entry Level positions
⭐⭐ - Mid Level positions
⭐⭐⭐ - Senior positionsLevel

Topics (Ongoing) No Of Questions
1. Data Science & ML - General Topics 34
2. Regression Techniques 22
3. Classification Techniques 39
3.1 Support Vector Machines 12
3.2 Decision Tree 16
3.2 Boosting( GBM, Light GBM, CatBoost) 5
3.2 Naive Bayes Classifier 5*
4. Stats & probabality 5
5. Deep Learning Fundamentals 15*
5.1 CNN 14
6 Attention & Transformers 10*

Data Science & ML - General Topics

  1. What is the basic difference between AI, Machine Learning(ML) & Deep Learning(DL)?. ⭐

Ans: Artificial Intelligence (AI) is a broad field that encompasses many different techniques and technologies, including machine learning (ML) and deep learning (DL).

al_ml_dl


  1. Can you explain the difference between supervised and unsupervised learning?. ⭐

Ans: The main difference between them is the type and amount of input provided to the algorithms.


  1. How do you handle missing data in your dataset? What are some common techniques for imputing missing values?. ⭐

Ans: There are several techniques for handling missing data in a dataset, some of the most common include:


  1. How do you select the appropriate evaluation metric for a given problem, and what are the trade-offs between different metrics such as precision, recall, and F1-score?. ⭐

Ans: Selecting the appropriate evaluation metric for a given problem depends on the characteristics of the data and the goals of the model. Here are some common evaluation metrics and the situations in which they are typically used:


  1. What is the beta value implies in the F-beta score? What is the optimum beta value?. ⭐⭐

    Ans: The F-beta score is a variant of the F1-score, where the beta value controls the trade-off between precision and recall. The F1-score is a harmonic mean of precision and recall and is calculated as (2 * (precision * recall)) / (precision + recall). - A beta value of 1 is equivalent to the F1 score, which means it gives equal weight to precision and recall. - A beta value less than 1 gives more weight to precision, which means it will increase the importance of precision over recall. - A beta value greater than 1 gives more weight to recall, which means it will increase the importance of recall over precision. fbeta


  1. What are the advantages & disadvantages of Linear Regression?.⭐

Ans:


  1. How do you handle categorical variables in a dataset?.⭐

Ans:Handling categorical variables in a dataset is an important step in the preprocessing of data before applying machine learning models. Here are some common techniques for handling categorical variables:


📙 Back to Top Section

  1. What is the curse of dimensionality and how does it affect machine learning?.⭐

Ans:The curse of dimensionality refers to the problem of increasing complexity and computational cost in high-dimensional spaces. In machine learning, the curse of dimensionality arises when the number of features in a dataset is large relative to the number of observations. This can cause problems for several reasons:


  1. What are the approaches to mitigate Dimensionality reduction?.⭐

Ans:These are some mechanisms to deal with Dimensionality reduction,


  1. Can you explain the bias-variance tradeoff?.⭐

Ans:The bias-variance tradeoff is a fundamental concept in machine learning that describes the trade-off between how well a model fits the training data (bias) and how well the model generalizes to new, unseen data (variance).


  1. How do you prevent overfitting in a model?.⭐

Ans: Overfitting occurs when a model is too complex and captures the noise in the training data, instead of the underlying patterns. This can lead to poor performance on new, unseen data. Here are some common techniques for preventing overfitting:


  1. What is Hypothesis Testing. Explain with proper example..⭐

Ans: Hypothesis testing is a statistical method used to determine whether a claim or hypothesis about a population parameter is true or false. The process of hypothesis testing involves making an initial assumption or hypothesis about the population parameter and then using sample data to test the validity of that assumption.


  1. What is Type 1 & Type 2 error?.⭐

Ans: *Type 1 error, also known as a false positive, occurs when the null hypothesis is rejected, but it is actually true. In other words, it is a mistake of rejecting a null hypothesis that is true. The probability of making a Type 1 error is represented by the level of significance (alpha) chosen before the hypothesis test. A common level of significance is 0.05, which means that there is a 5% chance of making a Type 1 error.


  1. Explain some of the Statistical test’s use cases (Ex- 2 Tail test, T-Test, Anona test, Chi-Squared test).⭐

Ans: The use cases of the tests are as follows,


  1. What do you mean when the p-values are high and low?.⭐

Ans:In hypothesis testing, the p-value is used to estimate the probability of obtaining a test statistic as extreme or more extreme than the one observed, assuming that the null hypothesis is true.


  1. What is the significance of KL Divergence in Machine Learning?.⭐⭐

Ans: KL divergence (also known as Kullback-Leibler divergence) is a measure of the difference between two probability distributions. In machine learning, KL divergence is used to measure the similarity or dissimilarity between two probability distributions, usually between the estimated distribution and the true distribution.


  1. How could you deal with data skewness? What are the approaches to resolve the skewness in the data?.⭐

Ans:Skewness is a measure of the asymmetry of a probability distribution.


  1. What is IQR? How it is been used to detect Outliers?.⭐

Ans: IQR stands for interquartile range. It is a measure of the spread of a dataset that is based on the difference between the 75th percentile (Q3) and the 25th percentile (Q1) of the data. To calculate IQR, you first need to find the median (Q2) of the data and then subtract Q1 from Q3. IQR


  1. What are the algorithms that are sensitive & Robust to Outliers?.⭐

Ans: There are several algorithms that are considered to be robust to outliers:


📙 Back to Top Section

  1. Why Feature Scaling is important? What are the feature scaling techniques?.⭐

Ans: Feature scaling is an important step in the preprocessing of data for machine learning algorithms because it helps to standardize the range of independent variables or features of the data.


  1. If we don’t remove the highly correlated values in the dataset, how does it impact the model performance?.⭐⭐

Ans: If you don’t remove high correlated values from your dataset, it can have a negative impact on the performance of your model.

Removing high correlated variables before training the model can help to improve the model’s performance by removing multicollinearity and reducing the complexity of the model.


  1. What is Spearman Correlation? What do you mean by positive and negative Correlation?.⭐

Ans:Spearman correlation is a measure of the statistical dependence between two variables, it is also known as the Spearman rank correlation coefficient. It is a non-parametric measure of correlation, which means that it does not assume that the underlying distribution of the data is normal. Instead, it calculates the correlation between the ranks of the data points.


  1. What is the difference between Co-Variance & Correlation?.⭐⭐

Ans: Covariance is a measure of the degree to which two random variables change together. It can be positive, negative or zero. A positive covariance means that the variables increase or decrease together, a negative covariance means that as one variable increases, the other variable decreases and a zero covariance means that there is no relationship between the two variables. The formula for covariance is: Cov(X, Y) = (1/n) * Σ(x - x̄) * (y - ȳ) where X and Y are the two random variables, x̄ and ȳ are the means of X and Y, respectively, and n is the number of observations.


  1. What is the difference between Multiclass Classification Models & Multilabel Classification Models?.⭐

Ans: In multiclass classification, the goal is to classify instances into one of several predefined classes. For example, classifying images of animals into dog, cat, and horse classes. Each instance can only belong to one class, and the classes are mutually exclusive.

Multilabel classification, on the other hand, is a problem where each instance can belong to multiple classes simultaneously. For example, classifying news articles into multiple topics, such as sports, politics, and technology. In this case, an article can belong to the sports and politics class simultaneously.


  1. Can you explain the concept of ensemble learning?.⭐

Ans: Ensemble learning is a technique in machine learning where multiple models are combined to create a more powerful model. The idea behind ensemble learning is to combine the predictions of multiple models to create a final prediction that is more accurate and robust than the predictions of any individual model.


  1. What are the different ensembling Modeling Strategies in ML?.⭐

Ans: There are several ensemble learning techniques, including:


  1. How do you select features for a machine-learning model?.⭐

Ans: There are several feature selection algorithms used in machine learning, including:


  1. What are the dimensionality Reduction techniques in Machine Learning?.⭐⭐

Ans: There are several dimensionality reduction techniques in machine learning, including:


  1. Explain the PCA steps in machine learning..⭐⭐

Ans:Principal Component Analysis (PCA) is a widely used dimensionality reduction technique in machine learning. It is a linear technique that transforms the original dataset into a new set of uncorrelated variables called principal components. The goal of PCA is to find the directions (principal components) that capture the most variation in the data. The following are the steps for performing PCA:

Ans:Cross-validation is a technique used to evaluate the performance of a machine learning model by dividing the data into training and testing sets and evaluating the model on the testing set. There are several types of cross-validation techniques that can be used, including:


  1. What are the trade-offs between the different types of Classification Algorithms? How would do you choose the best one?.⭐

Ans:Different types of classification algorithms have different strengths and weaknesses. The choice of algorithm depends on the specific characteristics of the data and the goal of the analysis. Here are some trade-offs between some common classification algorithms:


  1. How would you use the Naive Bayes classifier for categorical features? What if some features are numerical?.⭐⭐⭐

Ans:Naive Bayes is a probabilistic classifier that is based on the Bayes theorem, which states that the probability of a hypothesis (in this case, a class label) given some observations (in this case, feature values) is proportional to the probability of the observations given the hypothesis multiplied by the prior probability of the hypothesis.


  1. How does ROC curve and AUC value help measure how good a model is?.⭐⭐

Ans: The Receiver Operating Characteristic (ROC) curve and the Area Under the Curve (AUC) value are commonly used to evaluate the performance of binary classification models.

Ans: The size of the training set can be a factor to consider when choosing a classifier, but it is not the only one. Here are some general guidelines on how the size of the training set can affect the choice of classifier:


📙 Back to Top Section

Regression Techniques ( Concepts )

  1. Can you explain the difference between simple linear regression and multiple linear regression? How do you decide which one to use for a given problem?.⭐⭐

Ans:Simple linear regression is a type of linear regression where the target variable is predicted using a single predictor variable. The relationship between the predictor variable and the target variable is represented by a straight line. The equation for a simple linear regression model is:

y = b0 + b1*x

Where y is the target variable, x is the predictor variable, b0 is the y-intercept, and b1 is the slope of the line.

Multiple linear regression, on the other hand, is a type of linear regression where the target variable is predicted using multiple predictor variables. The relationship between the predictor variables and the target variable is represented by a hyperplane. The equation for a multiple linear regression model is:

y = b0 + b1_x1 + b2_x2 + ... + bn*xn

Where y is the target variable, x1, x2, …, xn are the predictor variables, b0 is the y-intercept, and b1, b2, …, bn are the coefficients of the predictor variables.

  1. What are the assumptions of linear regression and how do you check if they hold for a given dataset?.⭐

Ans:The assumptions of linear regression are:

Ans:Categorical imputations are the techniques that can be performed before fitting into the model.

  1. What is the difference between Lasso & Ridge regression?.⭐

Ans: Both Lasso and Ridge regression are types of linear regression, but they have different approaches to solving the problem of overfitting, which occurs when a model is too complex and captures the noise in the data as well as the underlying relationship.

ridge


  1. How do we select the right regularization parameters?.⭐

Ans: Regularization is a technique used to prevent overfitting by adding a penalty term to the cost function of a model. The regularization term is controlled by a parameter, called the regularization parameter or hyperparameter. The value of this parameter determines the strength of the regularization.


  1. Why is Logistic Regression called a Linear Model?.⭐

Ans: Logistic Regression is called a linear model because the relationship between the input features and the output variable is linear. The model is represented by a linear equation of the form:

y = b0 + b1_x1 + b2_x2 + ... + bn*xn where y is the output variable (a probability between 0 and 1), x1, x2, ..., xn are the input features, and b0, b1, b2, ..., bn are the model coefficients. The coefficients represent the contribution of each feature to the output variable. However, it's important to note that the logistic regression model is used for classification problem, the output variable (y) is the probability of the input belongs to a certain class, and this probability is modeled by a logistic function (sigmoid function) which is non-linear.

  1. Can Logistic regression can be used in an Imbalanced dataset Problem?.⭐⭐

Ans: Logistic regression can be used in an imbalanced dataset problem, but it is important to be aware of the limitations and potential issues that can arise.

Additionally, it’s important to use different evaluation metrics other than accuracy such as precision, recall, F1-score, AUC, etc.


  1. Can you explain the concept of polynomial regression? How does it differ from linear regression?.⭐⭐

Ans: Polynomial regression is a type of regression analysis in which the relationship between the independent variable x and the dependent variable y is modeled as an nth degree polynomial. A polynomial regression model can be represented by an equation of the form:

y = b0 + b1x + b2x^2 + b3x^3 + ... + bnx^n

where y is the output variable, x is the input variable, and b0, b1, b2, …, bn are the model coefficients. The coefficients represent the contribution of each degree of x to the output variable.


  1. How do you deal with outliers in linear regression? What are the most common techniques for identifying and dealing with outliers?.⭐⭐

Ans: Outliers in linear regression can have a significant impact on the model parameters and can lead to poor model performance. Here are some common techniques for identifying and dealing with outliers:


  1. What are the most common evaluation metrics in Regression Problems?.⭐

Ans: There are several common evaluation metrics used in regression problems, some of the most popular include:


  1. How R-squared is different with Adjusted R²? What is the main difference.⭐

Ans: R-squared (R²) and adjusted R-squared (adjusted R²) are both measures of the goodness of fit of a regression model, but they are slightly different.


  1. Can you explain the concept of Elastic Net regression? How does it differ from Ridge and Lasso regression?.⭐⭐

Ans:Elastic Net is a linear regression model that combines the properties of both Ridge and Lasso regression. Like Ridge, it adds a L2 regularization term to the cost function to prevent overfitting. Like Lasso, it adds a L1 regularization term to the cost function to perform feature selection. The trade-off between the L1 and L2 regularization terms is controlled by a parameter, alpha, that ranges from 0 to 1. When alpha = 0, the model becomes a Ridge regression, and when alpha = 1, it becomes a Lasso regression.


  1. In a sparse dataset where most of the values are 0, which supervised classification algorithm we should use?.⭐⭐

Ans:When dealing with a sparse dataset where most of the values are zero, a suitable supervised classification algorithm to use could be the Naive Bayes Classifier, especially the variant called Multinomial Naive Bayes, because it can handle the large number of zero values and it relies on counting the occurrences of the features, this method can work well when a dataset is sparse and the classifier can learn useful information from the occurrences of the features. It uses multinomial distributions for the likelihood estimates of the features, which models the occurrences which are robust to the sparse data.


  1. What are the regularization techniques other than L1 and l2 in machine learning?.⭐

Ans:Elastic Net Regularization: This is a combination of L1 and L2 regularization, where a linear combination of the L1 and L2 penalties is used in the cost function. It helps to balance the trade-off between sparsity and shrinkage.


  1. Which evaluation metrics are sensitive to the Outliers?.

Ans:Some evaluation metrics that are sensitive to outliers include mean absolute error, mean squared error. These metrics can be greatly affected by the presence of outliers in the data, as they take into account the individual differences between the predicted and actual values. On the other hand, median absolute error and coefficient of determination (R-squared) are resistant to the outliers.


  1. What is the difference between bagging and boosting?.⭐

Ans:Bagging and Boosting are two ensemble methods that are used to improve the performance of machine learning models.


  1. What is the difference between bagging and boosting?.⭐

Ans:Bagging and Boosting are two ensemble methods that are used to improve the performance of machine learning models.


  1. How does Decision Tree regressor works?.⭐

Ans:A Decision Tree Regressor is a supervised learning algorithm that is used to predict a continuous target variable based on several input features. The algorithm works by recursively splitting the data into subsets based on the values of the input features. Each split is made in such a way as to maximize the reduction in impurity of the target variable.


  1. Why don’t we use Mean Squared Error as a cost function in Logistic Regression?.⭐

Ans:In summary, MSE is a cost function for linear regression and it’s not a suitable cost function for logistic regression because it’s not a good measure of the difference between predicted probabilities (between 0 & 1) and true class labels. The log loss or cross-entropy loss is more appropriate cost function for logistic regression because it penalizes predictions that are confident but incorrect.


  1. How can we avoid Over-fitting in Logistic Regression models?

Ans:Regularization, Pruning, Cross-validation, Early stopping, and Ensemble methods are some of the techniques that can be used to avoid overfitting in logistic regression models.


  1. Explain Significance of Log Odds in Logistic regression with example.

Ans:In logistic regression, the log odds (also known as the logit) is a measure of the relationship between the independent variables and the dependent binary variable. The log odds is the natural logarithm (ln) of the odds of the dependent variable being a 1 (success) versus a 0 (failure).

For example, let’s say we are trying to predict the likelihood of a customer purchasing a product based on their age and income. The log odds for a customer with an age of 30 and an income of $50,000 purchasing the product would be calculated as:

ln(P(purchase=1)/(1-P(purchase=1))) = ln(odds of purchasing)

Where P(purchase=1) is the probability of purchasing the product.

The significance of the log odds in logistic regression is that it allows us to model the probability of the binary outcome as a linear function of the independent variables. The log odds is a linear function of the independent variables in logistic regression, so it allows us to make predictions about the probability of the binary outcome (e.g., purchase or no purchase) based on the values of the independent variables (e.g., age and income).


  1. Between Linear regression and Random forest regression which model will perform better in Airbnb House price prediction and why?

Ans:It depends on the specific characteristics of the data and the problem at hand. Both linear regression and random forest regression can be used to predict house prices on Airbnb, but they have different strengths and weaknesses.


📙 Back to Top Section

Classification Techniques ( Concepts )

Support Vector Machine (SVM)

  1. When would you use SVM vs Logistic regression?.⭐⭐⭐

Ans:Support Vector Machine (SVM) and Logistic Regression are both supervised learning algorithms that can be used for classification tasks. However, there are some key differences between the two that may influence which one you choose to use for a given problem.


  1. Why would you use the Kernel Trick?.⭐⭐⭐

Ans:When it comes to classification problems, the goal is to establish a decision boundary that maximizes the margin between the classes. However, in the real world, this task can become difficult when we have to treat with non-linearly separable data. One approach to solve this problem is to perform a data transformation process, in which we map all the data points to a higher dimension find the boundary and make the classification. That sounds alright, however, when there are more and more dimensions, computations within that space become more and more expensive. In such cases, the kernel trick allows us to operate in the original feature space without computing the coordinates of the data in a higher-dimensional space and therefore offers a more efficient and less expensive way to transform data into higher dimensions. There exist different kernel functions, such as:


  1. What is the Hinge Loss in SVM?.⭐⭐⭐

Ans:In Support Vector Machines (SVMs), the hinge loss is a commonly used loss function that is used to train the model to classify the data points correctly. The hinge loss is defined as the maximum of 0 and the difference between the true class label and the predicted class label. It is represented by the following equation:

hinge loss = max(0, 1 - y*(wx+b))

where y is the true class label (-1 or 1), wx+b is the predicted class label, and w and b are the model parameters.


  1. Can you explain the concept of the kernel trick and how it relates to SVM?.⭐⭐⭐

Ans: The kernel trick is a technique used in Support Vector Machines (SVMs) to transform the input data into a higher-dimensional space, where a linear decision boundary can be found. This allows SVMs to model non-linear decision boundaries even though the optimization problem is solved in a linear space.

Ans: Support Vector Machines (SVMs) are a powerful algorithm for binary classification problems. The standard SVM algorithm is known as the hard-margin SVM, which aims to find the maximum-margin hyperplane, which is a decision boundary that separates the two classes with the greatest possible margin. A margin is defined as the distance between the decision boundary and the closest data points from each class, known as support vectors.

  1. What are the components in Support Vector Machines.⭐⭐⭐

Ans:Support Vector Machine (SVM) is a supervised learning algorithm that can be used for classification and regression tasks. The SVM algorithm has several components, which include:


  1. How does the choice of kernel function affect the performance of SVM classifier?.⭐⭐⭐

Ans: The choice of kernel function in Support Vector Machine (SVM) classifier has a significant impact on the performance of the model. The kernel function maps the input data into a higher-dimensional space, where a linear boundary can be found. Different kernel functions have different properties that make them more suitable for different types of data and tasks.


Some Other variations of the Above Questions.

  1. How does SVM handle the case of non-linearly separable data?⭐⭐⭐
  2. How does the SVM algorithm handle multi-class classification problems?⭐⭐⭐
  3. How does one interpret the support vectors in an SVM classifier? ⭐⭐⭐
  4. How does the concept of margin maximization in SVM classifier relate to model interpretability?⭐⭐⭐
  5. How does the concept of kernel trick relate to the curse of dimensionality in SVM classifier?⭐⭐⭐

📙 Back to Top Section

Decision Tree Concepts

  1. How does a decision tree classifier work?.⭐

Ans:A decision tree classifier is a type of algorithm used for both classification and regression tasks. The basic idea behind a decision tree is to divide the feature space into smaller regions, called “leaves”, by recursively partitioning the data based on the values of the input features.


  1. What are the most common techniques for pruning a Decision Tree?.⭐⭐

Ans: Decision tree pruning is a technique used to reduce the complexity of a decision tree and prevent overfitting. Here are some common techniques for pruning a decision tree:


  1. How do you choose the optimal depth of a decision Tree?.⭐

Ans:Choosing the optimal depth of a decision tree is an important task in order to prevent overfitting and underfitting. There are several techniques to choose the optimal depth of a decision tree, here are some of the most common ones:


  1. What is Entropy & Gini Impurity? What are the differences between them? Which one is faster in terms of computation?.⭐

Ans:Entropy and Gini impurity are two measures used to evaluate the quality of a split in a decision tree. They are used to quantify the amount of disorder or randomness in a set of data.


  1. What is the fundamental concept of ID3, CART, CHAID, C4.5 Algorithm?.⭐⭐⭐

Ans:ID3, CART, CHAID, and C4.5 are all decision tree algorithms used for classification and regression tasks. They are all based on the same fundamental concept of recursively partitioning the feature space into smaller regions, called “leaves”, by selecting the best feature to split the data at each node. However, they differ in the way they select the best feature and the stopping criteria for the tree-building process.

Ans:Information gain is a measure used in decision tree algorithms to evaluate the quality of a split in the data. It measures the reduction in impurity or uncertainty in a set of examples after a split is made based on a specific feature. It is calculated as the difference between the entropy of the original set of examples and the weighted average of the entropy of the subsets of examples created by the split.


  1. What are the main hyperparameters in Decision Tree.⭐⭐

Ans:There are several hyperparameters that can be adjusted in decision tree algorithms, here are some of the main ones:


  1. How do we handle categorical variables in decision trees?.⭐⭐⭐

Ans: Using different Encoding Strategies.


  1. What is the difference between the OOB score and the validation score?

Ans:OOB (Out-of-Bag) score and validation score are two different ways to evaluate the performance of a decision tree algorithm.


  1. What are the most important hyperparameters in XGBoost Algorithm.⭐⭐⭐

Ans:XGBoost (eXtreme Gradient Boosting) is an ensemble learning algorithm that is commonly used for classification and regression tasks. The algorithm has several hyperparameters that can be adjusted to optimize the performance of the model. Here are some of the most important hyperparameters in XGBoost:


  1. How does the decision tree algorithm handle imbalanced datasets and what are the techniques to tackle it?.⭐⭐⭐

Ans: Decision tree algorithms can handle imbalanced datasets by modifying the criteria used for splitting the data. With imbalanced datasets, a majority class can easily dominate the decision tree and make it less sensitive to the minority class. There are several techniques that can be used to tackle imbalanced datasets in decision tree:

Some Other variations of the Above Questions.

  1. How does the concept of random forests relate to decision tree and how does it improve performance?⭐⭐⭐
  2. Can you discuss the use of decision tree for regression problems and the differences with classification tasks?⭐⭐⭐
  3. Can you discuss the interpretability of decision tree models and how it is related to the depth of the tree? ⭐⭐⭐
  4. How does the concept of margin maximization in SVM classifier relate to model interpretability?⭐⭐⭐
  5. How does the concept of kernel trick relate to the curse of dimensionality in SVM classifier?⭐⭐⭐

📙 Back to Top Section

Boosting Algorithms (GBM, LightGBM, CatBoost)

  1. What are the key two principles of LightGBM?.⭐

Ans: LightGBM is a gradient boosting framework that uses tree-based learning algorithms. The key two principles of LightGBM are:


  1. 1. How does Gradient Boosting algorithm differ from Random Forest algorithm?.⭐

Ans:Gradient Boosting and Random Forest are both ensemble learning techniques that combine multiple weak models to create a strong model. However, there are several key differences between the two algorithms:


  1. 1. How does the learning rate parameter in Gradient Boosting algorithm affect the performance of the model?.⭐ Ans:The learning rate parameter in Gradient Boosting algorithm affects the performance of the model by controlling the step size at which the algorithm moves in the direction of the negative gradient of the loss function. A smaller learning rate will result in more accurate model but it will take longer to train, while a larger learning rate will result in less accurate model but it will train faster. The best value of learning rate parameter is often found by using a grid search or by using a learning rate schedule, where the learning rate is decreased over time as the algorithm approaches convergence.

  1. What are most important Hyperparameters in GBM.⭐

Ans:Gradient Boosting Machine (GBM) has several important hyperparameters that can affect the performance of the model. The most important hyperparameters in GBM are:

  1. min_samples_split
    • Defines the minimum number of samples (or observations) which are required in a node to be considered for splitting.
  2. min_samples_leaf
    • Defines the minimum samples (or observations) required in a terminal node or leaf.
    • Generally lower values should be chosen for imbalanced class problems because the regions in which the minority class will be in majority will be very small.
  3. max_depth
    • The maximum depth of a tree.
    • Used to control over-fitting as higher depth will allow model to learn relations very specific to a particular sample.
    • Should be tuned using CV.
  4. learning_rate (Boosting Hyperparameter)
    • n_estimators

  1. 1. How does the concept of Gradient Boosting relate to boosting theory and how it improves the performance of weak learners?.⭐

Ans:Gradient Boosting is a machine learning technique that is based on the concept of boosting, which is a method for combining multiple weak learners to create a strong learner. The key idea behind boosting is to iteratively train weak models and combine their predictions in a way that improves the overall performance of the model.
In Gradient Boosting, the weak learners used are decision trees and the algorithm trains them in a sequence, where each tree tries to correct the mistakes of the previous tree.


  1. What are the key principles of Cat Boost Classifiers?.⭐

Ans:CatBoost is a gradient-boosting framework that is specifically designed to handle categorical variables. The key principles of CatBoost are:


📙 Back to Top Section

Naive Bayes Classifier

  1. How does the assumption of independence between features affect the performance of Naive Bayes classifier?.⭐

Ans:The assumption of independence between features in the Naive Bayes classifier affects the performance of the classifier in two main ways:


  1. Why Naive Bayes Classifier is called “Naive”.⭐

Ans:Naive Bayes Classifier is called “naive” because it makes the assumption that the features are independent of each other, which is often unrealistic, but it allows the classifier to make predictions based on the individual probabilities of each feature, rather than considering the relationships between the features. This simplifies the calculations and makes the classifier computationally efficient, but it can lead to a decrease in accuracy if the features are not truly independent.


  1. How does the choice of prior probability distribution affect the performance of Naive Bayes classifier?

Ans :The choice of prior probability distribution in the Naive Bayes classifier can affect the performance of the model in several ways:


  1. How does the Naive Bayes classifier perform in the presence of irrelevant features?

Ans :In summary, since the Naive Bayes classifier only considers the individual feature probabilities and not the relationships between features, irrelevant features will not affect its predictions. Irrelevant features will not change the prior probability of the class or the probability of the features given the class, and therefore will not affect the probability of the class given the features.


  1. 1. How does the concept of Laplace smoothing in Naive Bayes classifier improve the model’s performance?

Ans:Laplace smoothing, also known as add-k smoothing, is a technique used in Naive Bayes classifier to improve the model’s performance. The main idea behind Laplace smoothing is to avoid zero probabilities, which can cause the model to make incorrect predictions.


📙 Back to Top Section

Stats & Probablity Fundamentals

  1. In any 15-minute interval, there is a 20% chance that you see at least one shooting star, What is the probability that you see at least one shooting start in the next 60 minutes?

Ans:The probability of seeing at least one shooting star in the next 60 minutes can be found by using the complement rule. The complement rule states that the probability of an event happening is equal to 1 minus the probability of it not happening.


  1. Find the Probability of Getting 53 Sundays in a Non Leap Year..

Ans: A non-leap year has 365 days, and 52 full weeks, which is 364 days. Therefore, there is one day left over that is not part of a full week. Since there are 7 days in a week, we can assume that the remaining day can be any one of the 7 days of the week.

  1. Types of Selection bias.

Ans:


  1. How many types of probability distributions are there with examples?.

Ans: There are many types of probability distributions, and the specific types and examples can vary depending on the field or context in which they are being used. Some common types of probability distributions includd:


  1. What are the types of sampling?.

Ans: There are several types of sampling techniques, including:


  1. Difference between sample size and margin of error..

Ans: Sample size and margin of error are two important concepts in statistical sampling.


📙 Back to Top Section

Deep Learning Fundamentals

CNN Fundamentals ( Cost Function, Backpropagation)

  1. How do you handle overfitting in deep learning models? Can you discuss various techniques such as early stopping, dropout, and regularization, and when it would be appropriate to use each?.⭐

Ans:Overfitting in deep learning models occurs when a model is trained too well on the training data, and as a result, it performs poorly on unseen data. This happens because the model has learned the noise in the training data, rather than the underlying pattern.


  1. How do Neural Networks get the optimal Weights and Bias values?.⭐

Ans:


  1. What is the difference between loss and cost function in Deep Learning?.⭐

Ans:


  1. What are the roles of an Activation Function?.⭐

Ans:

  1. What’s the difference between Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) and in which cases would use each one?.⭐

`Ans: Convolutional neural nets apply a convolution to the data before using it in fully connected layers.

RNNs, on the other hand, are used for tasks involving sequential data, such as natural language processing and speech recognition. They are designed to handle data with temporal dependencies by using recurrent layers, which allow information to be passed from one time step to the next. RNNs can also be used in tasks such as language modeling, machine translation, and image captioning.


  1. What are the steps in CNN and explain the significance of each steps..⭐

Ans: The steps in a Convolutional Neural Network (CNN) typically include the following:


  1. Types of optimization Algorithms in CNN and their pros and cons..⭐

Ans: There are several types of optimization algorithms that can be used to train a convolutional neural network (CNN), each with its own pros and cons. Some commonly used optimization algorithms include:

  1. Stochastic Gradient Descent (SGD): This is a simple and widely used optimization algorithm that updates the weights by taking the gradient of the loss function with respect to the weights. One of the main advantages of SGD is that it is computationally efficient and easy to implement. However, it can be sensitive to the learning rate and may get stuck in local minima.

  2. Momentum: This is an extension of SGD that incorporates a momentum term to help the algorithm overcome local minima and converge faster. Momentum helps the optimization to continue in the direction of the previous steps, reducing the chances of getting stuck in a local minimum.

  3. Nesterov Accelerated Gradient (NAG): This is another variation of SGD that uses the momentum to estimate the next position of the weights, and then corrects it. This method is less sensitive to the learning rate and allows to reach the optimal solution faster than SGD and Momentum.

  4. Adagrad: This algorithm adapts the learning rate for each parameter individually, which can lead to faster convergence for some problems. However, since it uses a different learning rate for each parameter, it can lead to some parameters being updated more frequently than others, which can cause issues in some cases.

  5. Adadelta: This algorithm is an extension of Adagrad that tries to overcome its issue by using an average of the historical gradient updates to determine the learning rate for each parameter.

  6. Adam: Adaptive Moment Estimation (Adam) is an optimization algorithm that combines the advantages of Adagrad and Momentum. It uses moving averages of the parameters to provide a running estimate of the second raw moments of the gradients; the term adaptive in the name refers to the fact that the algorithm adapts the learning rates of each parameter based on the historical gradient information.

  7. RMSprop: This algorithm is similar to Adadelta and Adam, it also uses moving averages of the gradient updates to determine the learning rate for each parameter, but it uses the root mean square of the gradients instead of the sum of the gradients.

  8. Nadam: This algorithm is an extension of Adam and Nesterov Momentum.

  9. Types of Activation Function and their output range in CNN..⭐

Ans: Activation functions are used in the neurons of a convolutional neural network (CNN) to introduce non-linearity into the network. There are several types of activation functions, each with its own properties and output range, some commonly used activation functions include:

  1. Sigmoid: This function maps any input value to a value between 0 and 1. It’s mostly used in the output layer of binary classification problems.

  2. ReLU (Rectified Linear Unit): This function maps any input value less than 0 to 0 and any input value greater than 0 to the same value. It’s widely used in the hidden layers of the network.

  3. Tanh (Hyperbolic Tangent): This function maps any input value to a value between -1 and 1. It’s similar to the sigmoid function but outputs values in a wider range.

  4. Leaky ReLU: This function is similar to ReLU, but instead of mapping negative input values to 0, it maps them to a small negative value (e.g. 0.01). This helps to avoid the “dying ReLU” problem, where a neuron can become “dead” and output zero for all input values.

  5. ELU (Exponential Linear Unit): This function is similar to Leaky ReLU, but the negative part of the function has a slope of alpha, where alpha is a hyperparameter. This helps to avoid the “dying ReLU” problem and also helps to speed up the convergence.

  6. Softmax: This function is used in the output layer of multi-class classification problems. It maps any input values to a probability distribution over n classes, where n is the number of classes. sigmoid


  1. What is dying ReLu Problem?.⭐

Ans: The “dying ReLU” problem is a phenomenon that can occur in convolutional neural networks (CNNs) when using the rectified linear unit (ReLU) activation function. The ReLU activation function maps any input value less than 0 to 0, which can cause some neurons to “die” and output zero for all input values. This happens because if the weights in the network are not properly initialized or if the learning rate is too high, it is possible for the weights to update in such a way that the input to the ReLU activation function becomes negative for all training examples.


  1. What are the different cost functions for binary and multiclass classification problem?.⭐

Ans:


  1. What is Vanishing & Exploding Gradient problem in CNN.⭐

Ans: The vanishing and exploding gradient problem is a common issue that can occur in deep neural networks, including convolutional neural networks (CNNs).

  1. How do you approach transfer learning in deep learning, and when would you use a pre-trained model versus training from scratch?.⭐

Ans:Transfer learning is a technique in deep learning that allows a model that has been trained on one task to be used as a starting point for a new task. This can be done by using the pre-trained weights of a model that has already been trained on a large dataset, and then fine-tuning the model on the new task using a smaller dataset.

  1. What are the weight initialization techniques in Deep Learning?.⭐

Ans:

Weight initialization is an important step in training deep neural networks, as it can have a significant impact on the network’s performance and ability to converge. There are several weight initialization techniques that can be used, each with its own advantages and disadvantages. Some commonly used weight initialization techniques include:

  1. How do Neural Networks get the optimal Weights and Bias values?.⭐

Ans:

📙 Back to Top Section

Deep Learning - NLP

  1. Explain every block of Attention Networks and working principles..⭐

Ans:Attention networks are a type of neural network architecture that allows the model to focus on specific parts of the input while processing it. They are commonly used in natural language processing tasks such as machine translation and text summarization. Here are the main components of attention networks and their working principles:

In summary, Attention Networks are a neural network architecture that allows the model to focus on specific parts of the input while processing it. They use an encoder to process the input, an attention mechanism to determine which parts of the input are most important for the task at hand, and a decoder to generate the output. Attention weights are used to indicate how important each input is for the task, and Multi-head Attention is used to attend to multiple parts of the input simultaneously.


  1. Explain Attention with real-time example..⭐

Ans:Sure, let’s take an example of a sentence as input to the attention network: “I want to order pizza with extra cheese.”

Attention Blog

  1. Explain every block of transformers and its working taking an input sequence string as an example..⭐

Ans: Transformers are a type of neural network architecture that are commonly used in natural language processing tasks such as machine translation and text summarization. They were introduced in the paper “Attention Is All You Need” by Google researchers in 2017. Here are the main components of transformers and their working principles:

tarnsformers

Transformer Blog

  1. What is the need of Self Attention Layer in the transformer?.⭐

Ans:The self-attention layer in the transformer is used to determine which parts of the input are most important for the task at hand. It allows the model to focus on specific parts of the input while processing it, rather than processing the entire input in a sequential manner. This is particularly useful in natural language processing tasks, where the meaning of a sentence can depend on the relationships between words that are far apart in the sentence.

  1. Explain every component of BERT transformer in detail with a suitable example..⭐

`Ans: BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based language model developed by Google that is trained on a large corpus of text data. Here are the main components of BERT and their working principles:

  1. What are the evaluation metrics in question-answering system using Transformers?.⭐

Ans: There are several evaluation metrics that can be used to evaluate the performance of a question-answering system using Transformers, some of the most common ones are:

  1. How is RoBERTa different from BERT transformer?.⭐

Ans:RoBERTa (Robustly Optimized BERT Pretraining Approach) is a variation of the BERT transformer that was developed by researchers at Facebook AI. RoBERTa is similar to BERT in that it is a transformer-based language model that is trained on a large corpus of text data, but there are several key differences between the two models:

  1. How Confidence score is calculated in Roberta transformer based Question Answering System..⭐

Ans: In a RoBERTa-based question answering system, confidence scores are calculated to indicate how confident the model is in its predicted answer. The exact method of calculating the confidence score can vary depending on the specific implementation of the model, but some common methods include:

  1. Difference between Tranformer models and RNN/LSTM

Ans: RNNs and LSTMs indeed face several issues that make them less effective for long sequences compared to Transformer models. The primary challenges are vanishing gradients and limited context windows, which restrict their ability to capture long-range dependencies in sequences.