DAY 4 -Pandas(DataFrames)

DAY 4 -Pandas(DataFrames)

Welcome back guys. It's the fourth day of the "My 7-day journey to Data Science" blog series. In the previous blog, we learned about the series datatype in the pandas library.Today we are going to learn about another important datatype in pandas.

INDEX:

  • What is a dataframe

  • Accessing values

  • Conditional selection

As we all know Dataframes are the most useful datatype to store and manipulate datas for data analysis.This is the second most important datatype in pandas.

Dataframes looks like a table with multiple columns.But a Series can only have one colums other than its index.For declaring a dataframe,

import pandas
a=pandas.DataFrame({'REG':[01,02,03],
                    'MARK':[75,80,90] })

This values will be stored like a table with multiple columns,where the first column is by default the index position.For accessing the dataframe we can just use its variable name like

  a

//output
    REG    MARK
0    01    75    
1    02    80
2    03    90

So the dataframe is like a combination of series.

As same as the series,we can assign a name for the index positions like,

a.index=['FIRST','SECOND','THIRD']

Now the dataframe will looks like,

a

//output
        REG    MARK
FIRST    01    75
SECOND   02    80
THIRD    03    90

There are various inbuilt methods we can use with a dataframe,the most important and frequently used method is describe() -- it gives the total summary of each numeric columns in the dataframe such as min,max,count,25%,50%,75%,mean,standard deviation.This can be used like,

a.describe()

Other than describe(),there are various methods,some of them are

a.index        //Gives all the index
a.columns      //for viewing all the columns
a.info()       //Shows the structure of the dataframe
a.size         //GIves the number of all elements
a.shape        //gives the dimensions
a.dtypes       //gives the datatypes of each column present
a.dtypes.values_count()    //Give the count of each values in that column

Accessing values:

As we do in most of the other datatypes we can use "loc" or "iloc" keyword like,

a.loc['FIRST']            //single row based on the index name 
a.loc['FIRST','THIRD']    //multiple rows based on the index name
a.loc['FIRST':'THIRD']    //set of values based on the inedx name order

The "iloc" keyword will useful in scenarios where we did not know the name of the index position and only know the index position itself,

a.iloc[0]        //single row
a.iloc[0,3]      //multiple rows
a.iloc[0:3]      //set of rows

The "loc" and "iloc" details will be usefull if we need to retrieve values based on the rows,but incase if we need to retrieve values based on the columns we can use like,

a['MARK']

The above code returns all the values present in the "MARK" column in the dataframe "a".This will helpfull in situations where we need to filter results based on many conditions.For that we can combine these two ways of retrieving datas into one query like,

d.loc['FIRST':'THIRD',['MARK']]

In the above code the first two elements " 'FIRST' : 'THIRD' " refers to the row value on the dataframe and the third value "MARK" refers to the column value in the dataframe.

The output of the previous code will be something like,

        MARK
01        70
02        80
03        90

Conditional selections:

As we see in series the condition selection on pandas elements return a boolean array as its result.It returns "TRUE" for all the values in which the given condition is satisfied and "FALSE" where the given condition is not satisfied.

a['MARK']>75

The output of the above code will be like,

FIRST    FALSE
SECOND   TRUE
THIRD    TRUE

Instead of getting values like this,if we need the direct values we can use like,

a.loc[a['MARK']>75]

Other socials:

LinkedIn

GITHUB

Twitter

MEDIUM

THANKS FOR READING. Don't forget to follow our blog for more updates.