I get the impression that this is a common problem, I have a csv file with newlines within the fields. I am looking for a fix within Python--and within the csv module if possible.
Here is an example file that I have created
$ more test_csv.csv
a,"b",c,d,"e
e
e",f
a,bb,c,d,ee ,"f
f"
a,b,"c
c",d,e,f
Not all fields will be wrapped in quotes [although my usage is random in this example, the actual file should match quoting=csv.QUOTE_MINIMAL]
The output should resemble
[[a,b,c,d,"e\ne\ne",f],[a,bb,c,d,ee,"f\nf"][a,b,"c\nc",d,e,f]]
Instead I am getting
[[['a', 'b', 'c', 'd', 'e\n']], [['e']], [['e"', 'f']], [['a', 'bb', 'c', 'd', 'ee ', 'f\n']], [['f"']], [['a', 'b', 'c\n']], [['c"', 'd', 'e', 'f']]]
Please focus on the amount of rows and columns. Another concern is that in the thirds row, a quote was included when it should not have been.
Here is my code so far:
import csv
file = open['test_csv.csv', 'r']
rows = []
for line in file:
fields = []
mycsv = csv.reader[[line], dialect='excel', \
quotechar='"', quoting=csv.QUOTE_MINIMAL]
for field in mycsv:
fields.append[field]
rows.append[fields]
Thank you.
I assume that you want to keep the newlines in the strings for some reason after you have loaded the csv files from disk. Also that this is done again in Python. My solution will require Python 3, although the principle could be applied to Python 2.
The main trick
This is to replace the \n
characters before writing with a weird character that otherwise wouldn't be included, then to swap that weird character back for \n
after reading the file back
from disk.
For my weird character, I will use the Icelandic thorn: Þ, but you can choose anything that should otherwise not appear in your text variables. Its name, as defined in the standardised Unicode specification is: LATIN SMALL LETTER THORN. You can use it in Python 3 a couple of ways:
weird_literal = 'þ'
weird_name = '\N{LATIN SMALL LETTER THORN}'
weird_char = '\xfe' # hex representation
weird_literal == weird_name == weird_char # True
That \N
is pretty cool [and works in python 3.6 inside formatted strings too]... it basically allows you to pass the Name of a character,
as per Unicode's specification.
An alternative character that may serve as a good standard is '\u2063'
[INVISIBLE SEPARATOR].
Replacing \n
Now we use this weird character to replace '\n'
. Here are the two ways that pop into my mind for achieving this:
using a list comprehension on your list of lists:
data
:new_data = [[sample[0].replace['\n', weird_char] + weird_char, sample[1]] for sample in data]
putting the data into a dataframe, and using replace on the whole
text
column in one godf1 = pd.DataFrame[data, columns=['text', 'category']] df1.text = df.text.str.replace['\n', weird_char]
The resulting dataframe looks like this, with newlines replaced:
text category
0 some text in one line 1
1 text withþnew line character 0
2 another newþline character 1
Writing the results to disk
Now we write either of those identical dataframes to disk. I set index=False
as you said you don't want row numbers to be in the CSV:
FILE = '~/path/to/test_file.csv'
df.to_csv[FILE, index=False]
What does it look like on disk?
text,category
some text in one line,1
text withþnew line character,0
another newþline character,1
Getting the original data back from disk
Read the data back from file:
new_df = pd.read_csv[FILE]
And we can replace the Þ
characters back to \n
:
new_df.text = new_df.text.str.replace[weird_char, '\n']
And the final DataFrame:
new_df
text category
0 some text in one line 1
1 text with\nnew line character 0
2 another new\nline character 1
If you want things back into your list of lists, then you can do this:
original_lists = [[text, category] for index, text, category in old_df_again.itertuples[]]
Which looks like this:
[['some text in one line', 1],
['text with\nnew line character', 0],
['another new\nline character', 1]]