Data visualization

"Data visualization is the representation of data through use of common graphics, such as charts, plots, infographics, and even animations to translate the data/information into a visual context."

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from google.colab import files
uploaded = files.upload()

import io
my_data = pd.read_csv(io.BytesIO(uploaded['Titanic.csv']))
Upload widget is only available when the cell has been executed in the current browser session. Please rerun this cell to enable.
Saving Titanic.csv to Titanic.csv
In [2]:
my_data = pd.read_csv('Titanic.csv')

Titanic dataset

Below are the features provided in the Test dataset.

Passenger Id: and id given to each traveler on the boat Pclass: the passenger class. It has three possible values: 1,2,3 (first, second and third class) The Name of the passenger Sex Age SibSp: number of siblings and spouses traveling with the passenger Parch: number of parents and children traveling with the passenger The ticket number The ticket Fare The cabin number The embarkation. This describe three possible areas of the Titanic from which the people embark. Three possible values S,C,Q

class_label = [1: 'Survived', 2:'Not Survived']

In [3]:
my_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
In [4]:
print(my_data.head())
my_data.describe()
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  
Out[4]:
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

Finding Unique values

In [5]:
my_data["Embarked"].unique() ###Southampton, Cherbourg, and Queenstown where the boarding has happened
Out[5]:
array(['S', 'C', 'Q', nan], dtype=object)

HANDLING MISSING VALUES

In [6]:
m1=my_data["Age"].median(skipna=True)
m2=my_data["Age"].mean(skipna=True)
print("Median: {} and Mean: {} | Median age is 28 as compared to mean which is ~30".format(m1,m2))

a=sum(pd.isnull(my_data['Age'])) # COUNT Missing Values in age
b=round(a/(len(my_data["PassengerId"])),4) # proportion of "Age" missing in percent

# proportion of "Age" missing
print("Count of missing Values : {} , The Proportion of this values with dataset is {}\n".format(a,b*100))
Median: 28.0 and Mean: 29.69911764705882 | Median age is 28 as compared to mean which is ~30
Count of missing Values : 177 , The Proportion of this values with dataset is 19.869999999999997

In [7]:
a=sum(pd.isnull(my_data['Fare'])) # COUNT Missing Values in age
b=round(a/(len(my_data["PassengerId"])),4) # proportion of "Age" missing in percent

# proportion of "Fare" missing
print("Count of missing Values : {} , The Proportion of Fare values with dataset is {}\n".format(a,b*100))
Count of missing Values : 0 , The Proportion of Fare values with dataset is 0.0

In [8]:
###HANDLING MISSING VALUES

train_data = my_data
train_data["Age"].fillna(28, inplace=True)
train_data["Embarked"].fillna("S", inplace=True)
#train_data.drop('Cabin', axis=1, inplace=True)
train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          891 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     891 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

Line graph

Line charts are used to represent the relation between two data X and Y on a different axis. Here we will see some of the examples of a line chart in Python :

In [9]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

y2= my_data["Age"]
x2= np.arange(len(y2))
#x2 = my_data["Survived"]



plt.plot(x2, y2)
plt.show()

sns.lineplot(data=my_data, x="Survived", y="Age")
print(y2.mean())
29.36158249158249
In [10]:
survived_mean = my_data.query("Survived > 0")
print(survived_mean["Age"].mean())

not_survived_mean = my_data.query("Survived == 0")
print(not_survived_mean["Age"].mean())
28.29143274853801
30.028233151183972
In [11]:
import numpy as np

sns.lineplot( x = "Embarked",
			y = "Age",
			hue = "Survived",
			data = my_data);

Bar plot

A bar chart or bar graph is a chart or graph that presents categorical data with rectangular bars with heights or lengths proportional to the values that they represent. The bars can be plotted vertically or horizontally.

In [12]:
category_order = ['S', 
                  'C', 
                  'Q']

sns.catplot(x='Embarked',
            data=my_data,
            kind='count', 
            order=category_order)

plt.show()

""" Count plot: It gives you the count of  the instances of variable under each category"""
Out[12]:
' Count plot: It gives you the count of  the instances of variable under each category'
In [13]:
sns.barplot(data=my_data, x="Survived", y="Age")

plt.show() 

""" Bar plots look similar to count plots, but instead of the count of observations in each category, they show the mean of a quantitative variable among observations in each category."""
Out[13]:
' Bar plots look similar to count plots, but instead of the count of observations in each category, they show the mean of a quantitative variable among observations in each category.'

Age

The younger you are the more likely to survive?

In [14]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme(style="whitegrid")

data = [train_data]
for dataset in data:
    mean = train_data["Age"].mean()
    std = train_data["Age"].std()
    is_null = dataset["Age"].isnull().sum()
    # compute random numbers between the mean, std and is_null
    rand_age = np.random.randint(mean - std, mean + std, size = is_null)
    # fill NaN values in Age column with random values generated
    age_slice = dataset["Age"].copy()
    age_slice[np.isnan(age_slice)] = rand_age
    dataset["Age"] = age_slice
    dataset["Age"] = train_data["Age"].astype(int)
In [15]:
survived = 'survived'
not_survived = 'not survived'
fig, axes = plt.subplots(nrows=1, ncols=2,figsize=(16, 8))
women = train_data[train_data['Sex']=='female']
men = train_data[train_data['Sex']=='male']
ax = sns.distplot(women[women['Survived']==1].Age.dropna(), bins=18, label = survived, ax = axes[0], kde =False, color="green")
ax = sns.distplot(women[women['Survived']==0].Age.dropna(), bins=40, label = not_survived, ax = axes[0], kde =False, color="red")
ax.legend()
ax.set_title('Female')
ax = sns.distplot(men[men['Survived']==1].Age.dropna(), bins=18, label = survived, ax = axes[1], kde = False, color="green")
ax = sns.distplot(men[men['Survived']==0].Age.dropna(), bins=40, label = not_survived, ax = axes[1], kde = False, color="red")
ax.legend()
_ = ax.set_title('Male');
/usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

Influence of Pasenger Class on Survival

In [16]:
sns.barplot(x='Pclass', y='Survived', data=train_data);

plt.rc('xtick', labelsize=14) 
plt.rc('ytick', labelsize=14)

plt.figure()
fig = train_data.groupby('Survived')['Pclass'].plot.hist(histtype= 'bar', alpha = 0.8)
plt.legend(('Died','Survived'), fontsize = 12)
plt.xlabel('Pclass', fontsize = 18)
plt.show()

MEN v/s WOMEN SURVIORS

In [17]:
embarked_mode = train_data['Embarked'].mode()
data = [train_data]
for dataset in data:
    dataset['Embarked'] = dataset['Embarked'].fillna(embarked_mode)

FacetGrid = sns.FacetGrid(train_data, row='Embarked', size=4.5, aspect=1.6)
FacetGrid.map(sns.pointplot, 'Pclass', 'Survived', 'Sex', order=None, hue_order=None )
FacetGrid.add_legend();
/usr/local/lib/python3.8/dist-packages/seaborn/axisgrid.py:337: UserWarning: The `size` parameter has been renamed to `height`; please update your code.
  warnings.warn(msg, UserWarning)

Embarked Influence on Survival

In [18]:
sns.set(style="darkgrid")
sns.countplot( x='Survived', data=train_data, hue="Embarked", palette="Set1");
In [19]:
data = [train_data]
for dataset in data:
    dataset['relatives'] = dataset['SibSp'] + dataset['Parch']
    dataset.loc[dataset['relatives'] > 0, 'travelled_alone'] = 'No'
    dataset.loc[dataset['relatives'] == 0, 'travelled_alone'] = 'Yes'
axes = sns.factorplot('relatives','Survived', 
                      data=train_data, aspect = 2.5, );
/usr/local/lib/python3.8/dist-packages/seaborn/categorical.py:3717: UserWarning: The `factorplot` function has been renamed to `catplot`. The original name will be removed in a future release. Please update your code. Note that the default `kind` in `factorplot` (`'point'`) has changed `'strip'` in `catplot`.
  warnings.warn(msg)
/usr/local/lib/python3.8/dist-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

Pie chart

A Pie Chart is a circular statistical plot that can display only one series of data. The area of the chart is the total percentage of the given data. The area of slices of the pie represents the percentage of the parts of the data. The slices of pie are called wedges. The area of the wedge is determined by the length of the arc of the wedge. The area of a wedge represents the relative percentage of that part with respect to whole data. Pie charts are commonly used in business presentations like sales, operations, survey results, resources, etc as they provide a quick summary.

In [20]:
#PIE CHART

import matplotlib.pyplot as plt
import seaborn as sns
data = [650,180,61]
#define data
class_label = ["S", "C", "Q"]
#define Seaborn color palette to use
colors = sns.color_palette('pastel')[0:3]

#create pie chart
plt.pie(data, labels = class_label, colors = colors, autopct='%.0f%%')
plt.show()

Donut chart

Donut Charts or Doughnut Charts are a special kind of Pie chart with the difference that it has a Blank Circle at the center. The whole ring represents the data series taken into consideration. Each piece of this ring represents the proportion of the whole Data Series or percentage of total if the whole ring represents 100% of data. Donut Chart got its name from the Donuts which has a circle at its center.

In [21]:
# Create a pieplot
#define data
data = [650,180,61]
class_label = [1,2,3]
plt.pie(data)

# add a circle at the center to transform it in a donut chart
my_circle=plt.Circle( (0,0), 0.7, color='white')

# Give color names
plt.rcParams['text.color'] = 'red'  ###changing text colors
plt.pie(data, labels=class_label, colors=['red','green','blue']) ### Adding data labels
p = plt.gcf()
p.gca().add_artist(my_circle)

# Show the graph
plt.show()

Scatter Plot

A scatter plot is a plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data. Here each value in the data set is represented by a dot. It is used for understanding the relationship between the 2 variables.

In [22]:
my_data.plot(kind ="scatter",
          x ='Age',
          y ='Fare')
plt.grid()
WARNING:matplotlib.axes._axes:*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*.  Please use the *color* keyword-argument or provide a 2-D array with a single row if you intend to specify the same RGB or RGBA value for all points.
In [23]:
sns.set_style("whitegrid")
 
# sepal_length, petal_length are iris
# feature data height used to define
# Height of graph whereas hue store the
# class of iris dataset.
sns.FacetGrid(my_data, hue ="Survived",
              height = 6).map(plt.scatter,
                              'Age',
                              'Fare').add_legend()
Out[23]:
<seaborn.axisgrid.FacetGrid at 0x7f2ff09dbe50>

Pair plot

Pair Plot: A “pairs plot” is also known as a scatterplot, in which one variable in the same data row is matched with another variable's value, like this: Pairs plots are just elaborations on this, showing all variables paired with all the other variables.

In [24]:
###PAIR PLOT
#sns.pairplot(data=my_data,kind='scatter')
sns.pairplot(my_data,hue='Survived')
Out[24]:
<seaborn.axisgrid.PairGrid at 0x7f2fedb14fa0>

Boxplot or Whisker plot

Box plot was was first introduced in year 1969 by Mathematician John Tukey.Box plot give a statical summary of the features being plotted.Top line represent the max value,top edge of box is third Quartile, middle edge represents the median,bottom edge represents the first quartile value.The bottom most line respresent the minimum value of the feature.The height of the box is called as Interquartile range.The black dots on the plot represent the outlier values in the data.

In [25]:
### BOX PLOT
#sns.boxplot(x=my_data["Age"])

sns.boxplot(x='Survived',y='Age',data=my_data)
Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2fea7e5b50>

HEATMAP

A heatmap is a two-dimensional graphical representation of data where the individual values that are contained in a matrix are represented as colours. The Seaborn package allows the creation of annotated heatmaps which can be tweaked using Matplotlib tools as per the creator's requirement.

In [26]:
from scipy import stats

a = train_data["Survived"]
b = train_data["Age"]
stats.pointbiserialr(a, b)
Out[26]:
PointbiserialrResult(correlation=-0.0649089381669367, pvalue=0.0527661062109971)

HISTOGRAM

A histogram is a graph showing frequency distributions. It is a graph showing the number of observations within each given interval.

In [27]:
sns.histplot(data=my_data, x="Fare")
#sns.kdeplot(data=my_data, x="Fare")
Out[27]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2fe8e60d90>

Density plot

Density Plot is a type of data visualization tool. It is a variation of the histogram that uses ‘kernel smoothing’ while plotting the values. It is a continuous and smooth version of a histogram inferred from a data.

In [28]:
print("Density Plot of Age for Surviving Population and Deceased Population")
plt.figure(figsize=(15,8))
sns.kdeplot(train_data["Age"][train_data.Survived == 1], color="darkturquoise", shade=True)
sns.kdeplot(train_data["Age"][train_data.Survived == 0], color="lightcoral", shade=True)
plt.legend(['Survived', 'Died'])
plt.title('Density Plot of Age for Surviving Population and Deceased Population')
plt.show()
Density Plot of Age for Surviving Population and Deceased Population