View Discussion
Improve Article
Save Article
View Discussion
Improve Article
Save Article
CSV [Comma Separated Values] is a simple fileformat used to store tabular data, such as a spreadsheet or database. A CSV file stores tabular data [numbers and text] in plain text. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separator is the source of the name for this file format.
In this article, we are going to discuss various approaches to count the number of lines in a CSV file using Python.
We are going to use the below dataset to perform all operations:
Python3
import
pandas as pd
results
=
pd.read_csv[
'Data.csv'
]
print
[results]
Output:
To count the number of lines/rows present in a CSV file, we have two different types of methods:
- Using len[] function.
- Using a counter.
Using len[] function
Under this method, we need to read the CSV file using pandas library and then use the len[] function with the imported CSV file, which will return an int value of a number of lines/rows present in the CSV file.
Python3
import
pandas as pd
results
=
pd.read_csv[
'Data.csv'
]
print
[
"Number of lines present:-"
,
len
[results]]
Output:
Using a counter
Under this approach, we will be initializing an integer rowcount to -1 [not 0 as iteration will start from the heading and not the first row]at the beginning and iterate through the whole file and incrementing the rowcount by one. And in the end, we will be printing the rowcount value.
Python3
rowcount
=
0
for
row
in
open
[
"Data.csv"
]:
rowcount
+
=
1
print
[
"Number of lines present:-"
, rowcount]
Output:
2018-10-29 EDIT
Thank you for the comments.
I tested several kinds of code to get the number of lines in a csv file in terms of speed. The best method is below.
with open[filename] as f:
sum[1 for line in f]
Here is the code tested.
import timeit
import csv
import pandas as pd
filename = './sample_submission.csv'
def talktime[filename, funcname, func]:
print[f"# {funcname}"]
t = timeit.timeit[f'{funcname}["{filename}"]', setup=f'from __main__ import {funcname}', number = 100] / 100
print['Elapsed time : ', t]
print['n = ', func[filename]]
print['\n']
def sum1forline[filename]:
with open[filename] as f:
return sum[1 for line in f]
talktime[filename, 'sum1forline', sum1forline]
def lenopenreadlines[filename]:
with open[filename] as f:
return len[f.readlines[]]
talktime[filename, 'lenopenreadlines', lenopenreadlines]
def lenpd[filename]:
return len[pd.read_csv[filename]] + 1
talktime[filename, 'lenpd', lenpd]
def csvreaderfor[filename]:
cnt = 0
with open[filename] as f:
cr = csv.reader[f]
for row in cr:
cnt += 1
return cnt
talktime[filename, 'csvreaderfor', csvreaderfor]
def openenum[filename]:
cnt = 0
with open[filename] as f:
for i, line in enumerate[f,1]:
cnt += 1
return cnt
talktime[filename, 'openenum', openenum]
The result was below.
# sum1forline
Elapsed time : 0.6327946722068599
n = 2528244
# lenopenreadlines
Elapsed time : 0.655304473598555
n = 2528244
# lenpd
Elapsed time : 0.7561274056295324
n = 2528244
# csvreaderfor
Elapsed time : 1.5571560935772661
n = 2528244
# openenum
Elapsed time : 0.773000013928679
n = 2528244
In conclusion, sum[1 for line in f]
is fastest. But there might not be significant difference from len[f.readlines[]]
.
sample_submission.csv
is 30.2MB and has 31 million characters.