Welcome back guys. It's the fourth day of the "My 7-day journey to Data Science" blog series. In the previous blog, we learned about the series datatype in the pandas library.Today we are going to learn about another important datatype in pandas.
INDEX:
What is a dataframe
Accessing values
Conditional selection
As we all know Dataframes are the most useful datatype to store and manipulate datas for data analysis.This is the second most important datatype in pandas.
Dataframes looks like a table with multiple columns.But a Series can only have one colums other than its index.For declaring a dataframe,
import pandas
a=pandas.DataFrame({'REG':[01,02,03],
'MARK':[75,80,90] })
This values will be stored like a table with multiple columns,where the first column is by default the index position.For accessing the dataframe we can just use its variable name like
a
//output
REG MARK
0 01 75
1 02 80
2 03 90
So the dataframe is like a combination of series.
As same as the series,we can assign a name for the index positions like,
a.index=['FIRST','SECOND','THIRD']
Now the dataframe will looks like,
a
//output
REG MARK
FIRST 01 75
SECOND 02 80
THIRD 03 90
There are various inbuilt methods we can use with a dataframe,the most important and frequently used method is describe() -- it gives the total summary of each numeric columns in the dataframe such as min,max,count,25%,50%,75%,mean,standard deviation.This can be used like,
a.describe()
Other than describe(),there are various methods,some of them are
a.index //Gives all the index
a.columns //for viewing all the columns
a.info() //Shows the structure of the dataframe
a.size //GIves the number of all elements
a.shape //gives the dimensions
a.dtypes //gives the datatypes of each column present
a.dtypes.values_count() //Give the count of each values in that column
Accessing values:
As we do in most of the other datatypes we can use "loc" or "iloc" keyword like,
a.loc['FIRST'] //single row based on the index name
a.loc['FIRST','THIRD'] //multiple rows based on the index name
a.loc['FIRST':'THIRD'] //set of values based on the inedx name order
The "iloc" keyword will useful in scenarios where we did not know the name of the index position and only know the index position itself,
a.iloc[0] //single row
a.iloc[0,3] //multiple rows
a.iloc[0:3] //set of rows
The "loc" and "iloc" details will be usefull if we need to retrieve values based on the rows,but incase if we need to retrieve values based on the columns we can use like,
a['MARK']
The above code returns all the values present in the "MARK" column in the dataframe "a".This will helpfull in situations where we need to filter results based on many conditions.For that we can combine these two ways of retrieving datas into one query like,
d.loc['FIRST':'THIRD',['MARK']]
In the above code the first two elements " 'FIRST' : 'THIRD' " refers to the row value on the dataframe and the third value "MARK" refers to the column value in the dataframe.
The output of the previous code will be something like,
MARK
01 70
02 80
03 90
Conditional selections:
As we see in series the condition selection on pandas elements return a boolean array as its result.It returns "TRUE" for all the values in which the given condition is satisfied and "FALSE" where the given condition is not satisfied.
a['MARK']>75
The output of the above code will be like,
FIRST FALSE
SECOND TRUE
THIRD TRUE
Instead of getting values like this,if we need the direct values we can use like,
a.loc[a['MARK']>75]