Getting started with Data Science.

Rohit Prakash
4 min readFeb 16, 2020

What is Data Science?

Data Science at the heart of it is just exploring data and discovering numerous things when playing around with this data. For example, Netflix gathers user data and uses insights from this data to recommend better movies that suit your taste.

Data science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data. -Wikipedia

This science revolves around a cycle called the Data Science Process. A typical data science process involves obtaining the data, cleaning it, analyzing it, model fitting, interpret and visualizing.

Photo by Markus Spiske on Unsplash

Why Data Science?

The meaning of Data Science changes from one person to the next, but if you look at the enthralling work people have done using artificial intelligence, and it makes you want to do things like that too, then stop procrastinating and start learning. From a business point of view, it adds value by using scientific methods to make better decisions and improve profits. It uses existing data to predict scenarios that can be avoided in order to make better decisions.

Photo by João Silas on Unsplash

How to Data Science?

There are a plethora of guides on getting started with data science which suggest you take courses, read books, but in my opinion, the best way to learn effectively is to work on a project and you learn as you do!

The programming language of choice here is Python as it is wonderfully versatile as well as easy to work with, as it has an abundance of modules to help us in our data science journey.

Let’s start by making a simple gender classification program, disclaimer, this guide assumes that you already have python installed on your machine and you can start installing dependencies.

Install the following dependencies:

  • Scikit-learn
  • NumPy
  • SciPy

Enter the following into the terminal to install the dependencies.

pip install -U scikit-learnpip install numpypip install scipy

In this program, we are going to be using a model called the decision tree. A decision tree is a flowchart-like structure in which each internal node represents a “test” on an attribute (For example, whether a coin flip comes up heads or tails), each branch represents the outcome of the test, and finally, the leaf nodes represent the class labels.

Image from Wikipedia

The first step is to import the necessary libraries.

from sklearn import tree

We now initialize the decision tree classifier.

dt = tree.DecisionTreeClassifier()

Let X be a list of input features for example [height, weight, waist].

X =[[182, 80, 34], [176, 70, 33], [161, 60, 28], [154, 55, 27], [166, 63, 30], [189, 90, 36], [175, 63, 28], [177, 71, 30], [159, 52, 27], [171, 72, 32], [181, 85, 34]]

Let Y be a list of output tags that classify the inputs as either ‘male’ or ‘female’.

Y = ['male', 'male', 'female', 'female', 'male', 'male', 'female', 'female', 'female', 'male', 'male']

Now we fit our input data X to the decision tree. This is the step that uses the Machine Learning model of our choice and trains our data according to that model.

Model fitting is a measure of how well a machine learning model generalizes to similar data to that on which it was trained.

A model that is well-fitted produces more accurate outcomes. A model that is overfitted matches the data too closely,i.e, it predicts well only for the training set and for other data it will not produce accurate predictions. A model that is underfitted doesn’t match closely enough,i.e, it is not trained well enough to even produce correct predictions for the training set.

dt = dt.fit(X,Y)

We can now predict the output tag for a given input.

prediction = dt.predict([180,77,32])

We can check the prediction by printing out to the screen.

print(prediction)

You can explore the various other models that scikit-learn offers on their website and refer to the documentation on how to use it, and play around with different models.

This following picture helps in choosing the right model for your program.

Image from Scikit-Learn

Once you have played around with multiple models, you may want to compare them and find out which one is the best, to do that you use an accuracy metric provided by the sklearn.metrics module.

from sklearn.metrics import accuracy_score

To find the accuracy for the decision tree model,

acc_dt = accuracy_score(Y,prediction)*100print(f"The accuracy is: {acc_dt}")

We have successfully made a gender classification program that uses multiple machine learning models and compares their accuracy with each other and prints the best one.

There we have it! Our first data science program. This is a stepping stone on our wonderful journey in the data science world.

Check the following GitHub link for the source code over here.

Scikit-learn chart.

--

--