Master Panda is at His Best

Our post ‘Pandas Methods to Analyze your Data’ is based on  Data Analysis with Python  online course at cognitiveclass.ai. 

In this post we will learn various Pandas Methods to Analyze your data. Hope you have gone through Importing and Exporting Data(CSV format) in Python- Pandas way  post, if you are not then please do, as this post is next in the series.

In this post we will use a very famous Iris Data set( Author- R.A. Fisher , “UCI Machine Learning Repository: Iris Data Set” ). 

First we will import the Data in our Jupytor Notebook:-

import pandas as pd

path=”http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data”

iris= pd.read_csv(path,header=None)

If you read the Data Description at source you will know

Attribute Information:
   1. sepal length in cm
   2. sepal width in cm
   3. petal length in cm
   4. petal width in cm
   5. class: 
      -- Iris Setosa
      -- Iris Versicolour
      -- Iris Virginica

We will add the coloumn headers to imported data.

headers = [“sepal_length”,”sepal_width”, “petal_length”, “petal_width”, “class1”]

iris.columns = headers

Iris.head(6) will yield:-

sepal_length sepal_width petal_length petal_width class1
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
5 5.4 3.9 1.7 0.4 Iris-setosa

and iris.tail(5) will result:-

sepal_length                sepal_width          petal_length            petal_width          class1

144 6.7 3.3 5.7 2.5 Iris-virginica
145 6.7 3.0 5.2 2.3 Iris-virginica
146 6.3 2.5 5.0 1.9 Iris-virginica
147 6.5 3.0 5.2 2.0 Iris-virginica
148 6.2 3.4 5.4 2.3 Iris-virginica
149 5.9 3.0 5.1 1.8 Iris-virginica

 

Basic Insight of Dataset

After reading data into Pandas dataframe, it is time for us to explore the dataset a little bit. There are several ways to obtain essential insights of dataset, to help us better understand our dataset.

Data Types in Pandas

Data has variety of types. The main types stored in pandas objects are objectfloatintbooland datetime64.

Some Important Points:-

  1. The main types stored in Pandas objects are object, float, int, and datetime.
  2. The datatype names are somewhat different from those in native Python.
  3. Some are very similar, such as the numeric datatypes “int” and “float”.
  4. The “object” pandas type functions similar to “string” in Python, save for the change
  5. in name, while the “datetime” pandas type, is a very useful type for handling time series data.
  6. There are two reasons to check data types in a dataset. Pandas automatically assigns types based on the encoding it detects from the original data table.

In order to better learn about each attributes, it is always good for us to know the data type of each column. In Pandas:

dataframe.dtypes

returns a Series with the data type of each column.

iris.dtypes
sepal_length    float64
sepal_width     float64
petal_length    float64
petal_width     float64
class1           object
dtype: object

Describe

If we would like to check the statistical summary of each column, such as records count, column mean value, column standard deviation, etc.

dataframe.describe()

Generates various summary statistics, excluding NaN (Not a Number) values.

 

 

iris.describe()
sepal_length sepal_width petal_length petal_width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.054000 3.758667 1.198667
std 0.828066 0.433594 1.764420 0.763161
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000

To also check all the columns including other types (such as object) of data?

You can add an argument include = "all" inside the bracket.

iris.describe(include = “all”)

sepal_length sepal_width petal_length petal_width class1
count 150.000000 150.000000 150.000000 150.000000 150
unique NaN NaN NaN NaN 3
top NaN NaN NaN NaN Iris-virginica
freq NaN NaN NaN NaN 50
mean 5.843333 3.054000 3.758667 1.198667 NaN
std 0.828066 0.433594 1.764420 0.763161 NaN
min 4.300000 2.000000 1.000000 0.100000 NaN
25% 5.100000 2.800000 1.600000 0.300000 NaN
50% 5.800000 3.000000 4.350000 1.300000 NaN
75% 6.400000 3.300000 5.100000 1.800000 NaN
max 7.900000 4.400000 6.900000 2.500000 NaN

Now, it provides the statistic summary of all the columns, including object-typed attributes. For object-type columns, a different set of statistics is evaluated, like unique, top and frequency.

“Unique” is the number of distinct objects in the column, “top” is the most frequently occurring object, and “freq” is the number of times the top object appears in the column.

We can now see how many unique values, which is the top value and the frequency of top value in object-typed columns.

Some values in the table above show as “NaN”, that is because those number is not available regarding particular column type.

 iris[[‘sepal_length’]].describe() for a Particular colomn
sepal_length
count 150.000000
mean 5.843333
std 0.828066
min 4.300000
25% 5.100000
50% 5.800000
75% 6.400000
max 7.900000

Info Method

Another method you can use to check your dataset is:

dataframe.info

It provide a concise summary of your DataFrame. This function shows the top 30 rows and bottom 30 rows of the dataframe.

iris.info Check for yourself

For Advanced read:-

Practical data analysis with Python

This guide is a comprehensive introduction to the data analysis process using the Python data ecosystem and an interesting open dataset. There are four sections covering selected topics as follows: