I am trying to read a CSV file located in an AWS S3 bucket into memory as a pandas dataframe using the following code:
import pandas as pd
import boto
data = pd.read_csv['s3:/example_bucket.s3-website-ap-southeast-2.amazonaws.com/data_1.csv']
In order to give complete access I have set the bucket policy on the S3 bucket as follows:
{
"Version": "2012-10-17",
"Id": "statement1",
"Statement": [
{
"Sid": "statement1",
"Effect": "Allow",
"Principal": "*",
"Action": "s3:*",
"Resource": "arn:aws:s3:::example_bucket"
}
]
}
Unfortunately I still get the following error in python:
boto.exception.S3ResponseError: S3ResponseError: 405 Method Not Allowed
Wondering if someone could help explain how to either correctly set the permissions in AWS S3 or configure pandas correctly to import the file. Thanks!
Sometimes we may need to read a csv file from amzon s3 bucket directly , we can achieve this by using several methods, in that most common way is by using csv module.
import csv at the top of python file
import csv
then the function and code looks like this
s3 = boto3.client[
's3',
aws_access_key_id='XYZACCESSKEY',
aws_secret_access_key='XYZSECRETKEY',
region_name='us-east-1'
] #1obj = s3.get_object[Bucket='bucket-name', Key='myreadcsvfile.csv'] #2
data = obj['Body'].read[].decode['utf-8'].splitlines[] #3
records = csv.reader[data] #4
headers = next[records] #5
print['headers: %s' % [headers]]
for eachRecord in records: #6
print[eachRecord]
#1 — creating an object for s3 client with s3 access key , secret key and region [just assuming , reader already know what is access key and secret key.]
#2 — getting an object for our bucket name along with the file name of csv file.
In some cases we may not have csv file directly in s3 bucket , we may have folders and inside folders to get csv file , at that scenario the #2 line should change like below
obj = s3.get_object[Bucket='bucket-name', Key='folder/subfoler/myreadcsvfile.csv']
#3 — with the second line we got hand on object of csv file , now we need to read it , and the data will be in binary format so we are using decode[] function to convert it into readable format.
then we are using splitlines[] function to split each row as one record
#4 — now we are using csv.reader[data] to read the above data from line #3
with this we almost got the data , we just need to seperate headers and actual data
#5 — with this we will get all the headers of that entire csv file.
#6 — by using for loop , we are iterating through each record and printing each row of the csv files.
After getting the data we don’t want the data and headers to be in separate places , we want combined data saying which value belongs to which header. Let’s do it now , take one array variable before for loop
csvData = []
headerCount = len[headers]
and change the for loop like this
for eachRecord in records:
tmp = {}
for count in range[0,headerCount]:
tmp[headers[count]] = line[count]
csvData.append[tmp]
print[csvData]
now csvData contains the data in the below for
[{‘id’: ‘1’, ‘name’: ‘Jack’,‘age’: ‘24’},{‘id’: ‘2’, ‘name’: ‘Stark’,‘age’: ‘29’}]
Note: I formatted data in this format as it is my requirement , based on one’s requirement formatting data can be changed.
Hope this helped!, Happy Coding and Reading
Shi Han
Posted on Aug 22, 2020 • Updated on Sep 8, 2020
Here is a scenario. There is a huge CSV file on Amazon S3. We need to write a Python function that downloads, reads, and prints the value in a specific column on the standard output [stdout].
Simple Googling will lead us to the answer to this assignment in Stack Overflow. The code should look like something like the following:
import codecs
import csv
import boto3
client = boto3.client["s3"]
def read_csv_from_s3[bucket_name, key, column]:
data = client.get_object[Bucket=bucket_name, Key=key]
for row in csv.DictReader[codecs.getreader["utf-8"][data["Body"]]]:
print[row[column]]
We will explore the solution above in detail in this article. Imagine this like a rubber duck programming and you are the rubber duck in this case.
Downloading File from S3
Let's get started. First, we need to figure out how to download a file from S3 in Python. The official AWS SDK for Python is known as
Boto3. According to the documentation, we can create the client
instance for S3 by calling boto3.client["s3"]
. Then we call the get_object[]
method on the client
with bucket name and key as input arguments to download a specific file.
Now the thing
that we are interested in is the return value of the get_object[]
method call. The return value is a Python dictionary. In the Body
key of the dictionary, we can find the content of the file downloaded from S3. The body data["Body"]
is a
botocore.response.StreamingBody
. Hold that thought.
Reading CSV File
Let's switch our focus to handling CSV files. We want to access the value of a specific column one by one.
csv.DictReader
from the standard library seems to be an excellent candidate for this job. It returns an iterator [the class implements the iterator methods __iter__[]
and __next__[]
] that we can use to access each row in a for-loop: row[column]
. But what should we pass into X as an argument? According to the documentation, we
should refer to the reader instance.
All other optional or keyword arguments are passed to the underlying reader instance.
There we can see that the first argument csvfile
can be any object which supports the iterator protocol and returns a string each time its next[] method is called
botocore.response.StreamingBody
supports the iterator
protocol 🎉.
Unfortunately, it's __next__[]
method does not return a string but bytes instead.
_csv.Error: iterator should return strings, not bytes [did you open the file in text mode?]
Reading CSV file from S3
So how do we bridge the gap between botocore.response.StreamingBody
type and the type required by the cvs
module? We want to "convert" the bytes to string in this case. Therefore, the
codecs module of Python's standard library seems to be a place to start.
Most standard codecs are text encodings, which encode text to bytes
Since we are doing the opposite, we are looking for a "decoder," specifically a decoder that can handle stream data: codecs.StreamReader
Decodes data from the stream and returns the resulting object.
The codecs.StreamReader
takes a file-like object as an input argument. In Python, this means the object should have a read[]
method. The botocore.response.StreamingBody
does have the read[]
method:
//botocore.amazonaws.com/v1/documentation/api/latest/reference/response.html#botocore.response.StreamingBody.read
Since the codecs.StreamReader
also supports the iterator protocol, we can pass the object of this instance into the csv.DictReader
:
//github.com/python/cpython/blob/1370d9dd9fbd71e9d3c250c8e6644e0ee6534fca/Lib/codecs.py#L642-L651
The final piece of the puzzle is: How do we create the codecs.StreamReader
? That's where the codecs.getreader[]
function comes in play. We pass the codec of our choice [in this case,
utf-8
] into the codecs.getreader[]
, which creates thecodecs.StreamReader
. This allows us to read the CSV file row-by-row into dictionary by passing the codec.StreamReader
into csv.DictReader
:
Thank you for following this long and detailed [maybe too exhausting] explanation of such a short program. I hope you find it useful. Thank your listening ❤️.