Hey guys,its so good to see you all again in another blog.So this is my last day(Day-7) in “My 7day journey to Data science” blog series.Since we started our series till now we have learned topics like — What is Pandas,How can we use Pandas for data science and How to work with various files and datatypes in python for data science.
In this last blog we are going to see about Data cleaning in python.
INDEX:
— Finding the missing values
— Deleting the values
— Fixing the values
— Finding and deleting duplicated values
Finding the missing values:
So data cleaning is the process of Deleting or correcting the missing values,in-correct values,mis-placed values,etc.For correcting the missing values we have to find the missing values.It can be done using a pre-built method in pandas called “isna”.
pandas.isna(a)
Lets take that argument “a” is a variable name of a dataframe,then this above code will give a dataframe as an output but instead of the values inside the dataframe it gives “True” or “False”.If a cell has “true” then it means that the particular cell is empty.If it returns “False” then it means that the particular cell have some values in it.
We can also use it in various ways like
pandas.isnull(a)
a.isnull()
a.isna()
All these values gives the same result.Instead of checking whether there is any missing values is present in the data,we can also check whether there is any values present in it using “notna” method like
pandas.notna(a)
This will also give the dataframe as the result.But it will give “True” if there is any values present and “False” if there in no values in the cell.This is a kind of vice-versa for the previous method.
As same as “isna” method this can also be defined in various ways such as,
pandas.notnull(a)
a.notnull()
a.notna()
Incase if we need to see the number of null values present in the data we can do it with the helpof sum()
pandas.isnull(a).sum()
Deleting the values:
After finding the missing values if we need to delete them we can use “dropna” method like,
a.dropna()
This code will delete all the rows with null values.As we know most of the operations in Numpy and Pandas are immutable this code will not delete all the rows in the core data.Instead of that it will just creates a copy of that dataframe and deletes the values in that copy.
If we need to make the changes in the core data,it can be done like
a=a.dropna()
Till now we see about deleting rows if there is any null value.But if we need to delete a column if it has any null values,then it can be done like
a.dropna(axis=1)
This will deletes the column if it has any null value.
a.dropna(how="all")
The above code will deletes the row only if all the values in a row is null.
a.dropna(how="any")
The above code will deletes the row if any one of the cell is null.Instaed of this we can also set a threshold limit for deleting like,
a.dropna(thresh=2)
This will deletes the row if there is minimum 2 null values present.
Fixing the values:
— For series:
So instead of deleting the row or column with the null value we can also set a default value to it by using the “fillna” method.
a.fillna(0)
The above code will fill all the null value cells with the value of 0 by default.This can also be used with numeric calculations like,
a.fillna(a.mean())
There are various method like this,If we need to set a null value cell to its previous cell value it can be done like,
a.fillna(method="ffill")
If we need to do the same operation by reading the datas in column manner it can also be possible by setting the axis value like,
a.fillna(method="ffill",axis=1) //for column wise,filling the previous cell value
a.fillna(method="bfill",axis=1) //for column wise,filling the next cell value
— Dataframe:
The filling operations can also used in dataframes,While using it in dataframes we can specify different values for different coluns ,like
b.fillna('NAME':AK,'COUNTRY':INDIA)
The above code fills the null values for the column “NAME” as “AK” and for the column “COUNTRY” as “INDIA” in tha dataframe called “b”.
But instead of null values ,if all the specific values are entered incorrectly then that can also be changed like
d['Sex'].replace['D','F']
The above code will change all the values present in the ‘Sex’ column from “D” to “F”.
Finding and deleting duplicate values:
For viewing if there exist any duplicated value in the series it can be viewed with “duplicated” method.
e.duplicated()
Will gives true for all the duplicated value other than its original value.For dataframes we can specify the column name like,
df.duplicated(subset=['REG'])
In case if we need the duplication search has to be done from bottom to top method we can use the keep parameter,
e.duplicated(keep='last')
In some scenarios we need all the duplicated values needs to be viewed irrespective of their position of first or last it can be done like,
e.duplicated(keep=False)
The above code returns True for every values that is duplicated.If we need to delete the duplicated values we can use the “drop_duplicated” method like,
e.drop_duplicated()