Web Scraping Using lxml

In this article I will demonstrate a few examples of web scraping. According to the Wikipedia article, Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. There are a lots of tutorial on web scraping. In this post I will demonstrate web scraping while solving a few problems. I will use Python3 and a few libraries for this purpose. Lets get into the problem.
Continue reading “Web Scraping Using lxml”

Data Science Interview Questions

In this post I am going to make a compilation of interview questions for data science role. A big part of them are questions that I faced during my interviews. I have also gathered questions from different websites and which I found interesting. So, lets get started.

What do you know about bias-variance/bias-variance tradeoff?

In statistics and machine learning, the bias–variance tradeoff (or dilemma) is the problem of simultaneously minimizing two sources of error that prevent supervised learning algorithms from generalizing beyond their training set [Wikipedia]:

  • The bias is an error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting). Bias are the simplifying assumptions made by a model to make the target function easier to learn. Examples of low-bias machine learning algorithms include: Decision Trees, k-Nearest Neighbors and Support Vector Machines. Examples of high-bias machine learning algorithms include: Linear Regression, Linear Discriminant Analysis and Logistic Regression [2].
  • The variance is an error from sensitivity to small fluctuations in the training set. High variance can cause an algorithm to model the random noise in the training data, rather than the intended outputs (overfitting). Variance is the amount that the estimate of the target function will change if different training data was used. Low variance suggests small changes to the estimate of the target function with changes to the training dataset. High variance suggests large changes to the estimate of the target function with changes to the training dataset. Generally, nonparametric machine learning algorithms that have a lot of flexibility have a high variance. For example, decision trees have a high variance, that is even higher if the trees are not pruned before use. Examples of low-variance machine learning algorithms include: Linear Regression, Linear Discriminant Analysis and Logistic Regression. Examples of high-variance machine learning algorithms include: Decision Trees, k-Nearest Neighbors and Support Vector Machines [2].

Continue reading “Data Science Interview Questions”