Python csv ignore newline in quotes

I get the impression that this is a common problem, I have a csv file with newlines within the fields. I am looking for a fix within Python--and within the csv module if possible.

Here is an example file that I have created

$ more test_csv.csv
a,"b",c,d,"e
e
e",f
a,bb,c,d,ee ,"f
f"
a,b,"c
c",d,e,f

Not all fields will be wrapped in quotes (although my usage is random in this example, the actual file should match quoting=csv.QUOTE_MINIMAL)

The output should resemble

[[a,b,c,d,"e\ne\ne",f],[a,bb,c,d,ee,"f\nf"][a,b,"c\nc",d,e,f]]

Instead I am getting

[[['a', 'b', 'c', 'd', 'e\n']], [['e']], [['e"', 'f']], [['a', 'bb', 'c', 'd', 'ee ', 'f\n']], [['f"']], [['a', 'b', 'c\n']], [['c"', 'd', 'e', 'f']]]

Please focus on the amount of rows and columns. Another concern is that in the thirds row, a quote was included when it should not have been.

Here is my code so far:

import csv

file = open('test_csv.csv', 'r')
rows = []
for line in file:
  fields = []  
  mycsv = csv.reader([line], dialect='excel', \
    quotechar='"', quoting=csv.QUOTE_MINIMAL)
  for field in mycsv:
    fields.append(field)
  rows.append(fields)

Thank you.

I assume that you want to keep the newlines in the strings for some reason after you have loaded the csv files from disk. Also that this is done again in Python. My solution will require Python 3, although the principle could be applied to Python 2.

The main trick

This is to replace the \n characters before writing with a weird character that otherwise wouldn't be included, then to swap that weird character back for \n after reading the file back from disk.

For my weird character, I will use the Icelandic thorn: Þ, but you can choose anything that should otherwise not appear in your text variables. Its name, as defined in the standardised Unicode specification is: LATIN SMALL LETTER THORN. You can use it in Python 3 a couple of ways:

    weird_literal = 'þ'
    weird_name = '\N{LATIN SMALL LETTER THORN}'
    weird_char = '\xfe'  # hex representation
    weird_literal == weird_name == weird_char  # True

That \N is pretty cool (and works in python 3.6 inside formatted strings too)... it basically allows you to pass the Name of a character, as per Unicode's specification.

An alternative character that may serve as a good standard is '\u2063' (INVISIBLE SEPARATOR).

Replacing \n

Now we use this weird character to replace '\n'. Here are the two ways that pop into my mind for achieving this:

  1. using a list comprehension on your list of lists: data:

     new_data = [[sample[0].replace('\n', weird_char) + weird_char, sample[1]]
                  for sample in data]
    
  2. putting the data into a dataframe, and using replace on the whole text column in one go

     df1 = pd.DataFrame(data, columns=['text', 'category'])
     df1.text = df.text.str.replace('\n', weird_char)
    

The resulting dataframe looks like this, with newlines replaced:

               text              category
0         some text in one line      1   
1  text withþnew line character      0   
2    another newþline character      1   

Writing the results to disk

Now we write either of those identical dataframes to disk. I set index=False as you said you don't want row numbers to be in the CSV:

FILE = '~/path/to/test_file.csv'
df.to_csv(FILE, index=False)

What does it look like on disk?

text,category

some text in one line,1

text withþnew line character,0

another newþline character,1

Getting the original data back from disk

Read the data back from file:

new_df = pd.read_csv(FILE)

And we can replace the Þ characters back to \n:

new_df.text = new_df.text.str.replace(weird_char, '\n')

And the final DataFrame:

new_df
               text               category
0          some text in one line      1   
1  text with\nnew line character      0   
2    another new\nline character      1   

If you want things back into your list of lists, then you can do this:

original_lists = [[text, category] for index, text, category in old_df_again.itertuples()]

Which looks like this:

[['some text in one line', 1],
 ['text with\nnew line character', 0],
 ['another new\nline character', 1]]