Data Analysis - Working with files(CSV,SQL,HTML,EXCEL)

Data Analysis - Working with files(CSV,SQL,HTML,EXCEL)

Hey folks, It's so happy to see you all in another blog. As you all know I’m Akash Pugal from India and this is Day 6 in the “My 7-day journey to Data Science” blog series.

Till the previous blog, we have seen about how to work with Pandas and NumPy based on their input methods. In this blog we are going to learn How to read and write data from various file formats like CSV, Html, SQL, Excel sheets and start working with them.

INDEX:

  • Read and write CSV files

  • SQL files

  • Html files

  • Excel sheets

Read and write CSV files:

As we know there are various libraries available in Python for file operations,In this blog we are going to read and write using two methods. They are,

→CSV Library

→CSV method in Pandas

CSV Library:

For using the CSV library we have to import it like usual,

import csv

For reading the files using CSV library we have to use a new method called “reader”.We can do it like

with open('pugal.csv','r') as f:
  data=csv.reader(f,delimiter=',')

This code opens the file named “pugal.csv” in read mode and stores the data into a variable called “data” using the csv reader.This “csv.reader” can have many parameters but not every parameter is necessary.

The first element is always the variable name/url link of the file we need to work with.The “delimeter” method is used to split the lines in the csv file based on the needs.In this given code,in each line of the given file the compiler will splits each value based on the comma(,) and make it as a new column.

If we need to skip a line from reading ,we can use next keyword like

next(data)

This code will skip one line from the file.We can use it at the top of the file to skip empty rows or headers if we didn’t need it.

CSV method in Pandas:

This is the second method to work with csv files.Without importing anyother new library we can just use a method in pandas library called “read_csv”.If we have the file in local storage we can use it like,

import pandas

a=pandas.read("C:/NewFolder/list.csv")

This also reads the file from its location and stores it in a variable called “a”.But in case if we have the url of the file instead of the file in local storage ,we can also use it like,

b="https;//drive/ak/pugal.csv"
a=pandas.read_csv(b)

By using the “header” parameter inside the method we can define the first line of the file should be treated as a header or not.

a=pandas.read_csv(filename,header=None)

This tell the compiler to not to treat the first line of the file as header.

If we need to define the column names by our own,we can give it using the “names” parameter,like

a=pandas.read_csv(filename,names=['Name', 'Phone'])

This code defines the first column of the file as Name and the next column as Phone.

Incase od we have to treat the cells with a specific values as a “NaN” column which is a kind of null value,we can specify it using the “na_value” parameter.

A=pandas.read_csv(filename,na_values=['-', ','])

If we need to split the values in the file based on our own symbols or values such as “delimiter” we can specify it using “sep” keyword in pandas like

A=pandas.read_csv(filename,sep= ',')

This splits the value to each column every time if the comma(,) symbol occurs.

Storing in CSV:

As reading the csv files,we can also write to a csv file using “to_csv” method to store it locally.

pandas.to_csv("output.csv")

This stores the result as the csv file named “output” in the local storage.

Sql files:

For handling database related operations we can use the “sqllite3” library with the “read_sql” method from Pandas.We can import sqllite3 library as

import sqllite3

First of all we have to create a connection with the database in order to perfom operations in it.Creating a connection can be done with the help of “sqllite3” library like,

con=sqllite3.connect('ak.db')

Here, the connection is created between the database named “ak” and been stored in a variable “con”.Now in order to do query operations we have to use the “read_sql” method from pandas library.

a=pandas.read_sql('Select * from draft',con,index_col='Reg')

This code reads all the data from the “draft” table using “read_sql” method from pandas and stores it into the variable “a”.Usually the first parameter for “read_sql” method should be the query we have to pass.The next parameter should be the connection variable we created previously using “sqllite3.connect” method.The “index_col” parameter is used to set the column which must be used as the index.

As same as reading sql files we can also write datas directly to the sql files using “to_sql” method from pandas library.

HTML:

So Html tables and files are mostly optimised for the users and not for the computers.So it may be very dificult to save and clean datas in an complex html page table.

For reading html tables we can use the “read_html” method from pandas library.So for reading a html table

a="https://_________"
table=pandas.read(a)

The above code takes all the tables int he url specified in the variable “a” and stores it all in another variable called “table”.To view the length of the total tabels fetched ,we can use the “len” keyword,

len(table)

If there are various tables in the page and we need only a specified table to be stored in a new variable,then we can do it like

newtable=table[2]

If we need all the tables in the html page to be displayed ,then we can do it by just calling the variable name

table

If we need to delete a specific table variable we can do it using the “drop”command.

new.drop()

Excel files:

Till now we learned about how to work with csv files,sql files and html pages.Now lets see how to work with excel sheets.For reading from a excel file we can use the “read_excel” method in pandas.

b=pandas.read_excel(filename)

This reads the excel file and stores it in the variable “b”.If this excel file has more dhan one sheets then it just read the first sheet and stores it.,But incase if we did not need the first sheet and we need any other specific sheet ,then we can specify it like,

b=pandas.read_excel=(filename,sheet_name='Product')

Like this we can specify which sheet needs to be readed.This method also has various parameter to work with excel files but not everything is mandatory.

As same as this we can store the results to an excel sheet by using the “to_excel” method in pandas.

b.to_excel('NEW.xlsx')

This stores the result in the variable “b” to an excel sheet named “NEW”.But if we need to store more dhan one sheet in a file and if we need to specify the sheet name that is also possible by using the “sheet_name” parameter

b.to_excel('NEX.xlsx',sheet_name='Products')

This method also has various parameter like header,index_col,etc.

Other socials:

LinkedIn

GITHUB

Twitter

MEDIUM

THANKS FOR READING. Don't forget to follow our blog for more updates.