Data visualization

"Data visualization is the representation of data through use of common graphics, such as charts, plots, infographics, and even animations to translate the data/information into a visual context."

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from google.colab import files
uploaded = files.upload()

import io
my_data = pd.read_csv(io.BytesIO(uploaded['Iris.csv']))
Upload widget is only available when the cell has been executed in the current browser session. Please rerun this cell to enable.
Saving Iris.csv to Iris.csv

Iris dataset

The Iris dataset was used in R.A. Fisher's classic 1936 paper, The Use of Multiple Measurements in Taxonomic Problems, and can also be found on the UCI Machine Learning Repository.

It includes three iris species with 50 samples each as well as some properties about each flower. One flower species is linearly separable from the other two, but the other two are not linearly separable from each other.

The columns in this dataset are:

sepal_length, sepal_width, petal_length, petal_width, & class_label

class_label = ['Iris_setosa', 'Iris_virginica', 'Iris_versicolor']

In [2]:
my_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   class_label   150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
In [3]:
my_data.describe()
Out[3]:
sepal_length sepal_width petal_length petal_width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.054000 3.758667 1.198667
std 0.828066 0.433594 1.764420 0.763161
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000

Line graph

Line charts are used to represent the relation between two data X and Y on a different axis. Here we will see some of the examples of a line chart in Python :

In [4]:
sns.lineplot(data=my_data, x="class_label", y="sepal_length")
Out[4]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f07481fff10>
In [5]:
my_data2 = my_data.query("sepal_length > 5")
sns.lineplot(data=my_data2, x="class_label", y="sepal_length")
Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f07480ff280>
In [6]:
####LINE GRAPH
#sns.lineplot(data=my_data, x="class_label", y="sepal_length")
sns.lineplot(data=my_data, x="petal_length", y="sepal_length",  hue = 'class_label',  legend = 'auto')
Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f0747bc7bb0>

Bar plot

A bar chart or bar graph is a chart or graph that presents categorical data with rectangular bars with heights or lengths proportional to the values that they represent. The bars can be plotted vertically or horizontally.

In [7]:
category_order = ['Iris_setosa', 
                  'Iris_virginica', 
                  'Iris_versicolor']

sns.catplot(x='class_label',
            data=my_data,
            kind='count', 
            order=category_order)

plt.show()

""" Count plot: It gives you the count of  the instances of variable under each category"""
Out[7]:
' Count plot: It gives you the count of  the instances of variable under each category'
In [8]:
sns.barplot(data=my_data, x="class_label", y="sepal_length")

plt.show() 

""" Bar plots look similar to count plots, but instead of the count of observations in each category, they show the mean of a quantitative variable among observations in each category."""
Out[8]:
' Bar plots look similar to count plots, but instead of the count of observations in each category, they show the mean of a quantitative variable among observations in each category.'

Pie chart

A Pie Chart is a circular statistical plot that can display only one series of data. The area of the chart is the total percentage of the given data. The area of slices of the pie represents the percentage of the parts of the data. The slices of pie are called wedges. The area of the wedge is determined by the length of the arc of the wedge. The area of a wedge represents the relative percentage of that part with respect to whole data. Pie charts are commonly used in business presentations like sales, operations, survey results, resources, etc as they provide a quick summary.

In [9]:
#PIE CHART

import matplotlib.pyplot as plt
import seaborn as sns
data = [150, 150, 150]
#define data
class_label = ['Iris_setosa', 
                  'Iris_virginica', 
                  'Iris_versicolor']
#define Seaborn color palette to use
colors = sns.color_palette('pastel')[0:3]

#create pie chart
plt.pie(data, labels = class_label, colors = colors, autopct='%.0f%%')
plt.show()

Donut chart

Donut Charts or Doughnut Charts are a special kind of Pie chart with the difference that it has a Blank Circle at the center. The whole ring represents the data series taken into consideration. Each piece of this ring represents the proportion of the whole Data Series or percentage of total if the whole ring represents 100% of data. Donut Chart got its name from the Donuts which has a circle at its center.

In [10]:
# Create a pieplot
#define data
data = [150, 150, 150]
class_label = ['Iris_setosa', 
                  'Iris_virginica', 
                  'Iris_versicolor']
plt.pie(data)

# add a circle at the center to transform it in a donut chart
my_circle=plt.Circle( (0,0), 0.7, color='white')

# Give color names
plt.rcParams['text.color'] = 'red'  ###changing text colors
plt.pie(data, labels=class_label, colors=['red','green','blue']) ### Adding data labels
p = plt.gcf()
p.gca().add_artist(my_circle)

# Show the graph
plt.show()

Scatter Plot

A scatter plot is a plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data. Here each value in the data set is represented by a dot. It is used for understanding the relationship between the 2 variables.

In [11]:
my_data.plot(kind ="scatter",
          x ='sepal_length',
          y ='petal_length')
plt.grid()
In [12]:
sns.set_style("whitegrid")
 
# sepal_length, petal_length are iris
# feature data height used to define
# Height of graph whereas hue store the
# class of iris dataset.
sns.FacetGrid(my_data, hue ="class_label",
              height = 6).map(plt.scatter,
                              'sepal_length',
                              'petal_length').add_legend()
Out[12]:
<seaborn.axisgrid.FacetGrid at 0x7f0747be8a00>

Pair plot

Pair Plot: A “pairs plot” is also known as a scatterplot, in which one variable in the same data row is matched with another variable's value, like this: Pairs plots are just elaborations on this, showing all variables paired with all the other variables.

In [13]:
###PAIR PLOT
#sns.pairplot(data=my_data,kind='scatter')
sns.pairplot(my_data,hue='class_label')
Out[13]:
<seaborn.axisgrid.PairGrid at 0x7f0745281550>

Boxplot or Whisker plot

Box plot was was first introduced in year 1969 by Mathematician John Tukey.Box plot give a statical summary of the features being plotted.Top line represent the max value,top edge of box is third Quartile, middle edge represents the median,bottom edge represents the first quartile value.The bottom most line respresent the minimum value of the feature.The height of the box is called as Interquartile range.The black dots on the plot represent the outlier values in the data.

In [14]:
### BOX PLOT
#sns.boxplot(x=my_data["petal_length"])

sns.boxplot(x='class_label',y='petal_length',data=my_data)
Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f0742c306d0>

HEATMAP

A heatmap is a two-dimensional graphical representation of data where the individual values that are contained in a matrix are represented as colours. The Seaborn package allows the creation of annotated heatmaps which can be tweaked using Matplotlib tools as per the creator's requirement.

In [15]:
df = my_data.iloc[0:150,0:4]

plt.figure(figsize=(7,4))
sns.heatmap(df.corr(),annot=True,cmap='summer')
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f0741391730>

HISTOGRAM

A histogram is a graph showing frequency distributions. It is a graph showing the number of observations within each given interval.

In [16]:
sns.histplot(data=my_data, x="sepal_length")
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f07412c5850>

Density plot

Density Plot is a type of data visualization tool. It is a variation of the histogram that uses ‘kernel smoothing’ while plotting the values. It is a continuous and smooth version of a histogram inferred from a data.

In [17]:
sns.kdeplot(data=my_data, x="sepal_length")
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f07412a4d00>