Read csv from s3 bucket python

I am trying to read a CSV file located in an AWS S3 bucket into memory as a pandas dataframe using the following code:

import pandas as pd
import boto

data = pd.read_csv['s3:/example_bucket.s3-website-ap-southeast-2.amazonaws.com/data_1.csv']

In order to give complete access I have set the bucket policy on the S3 bucket as follows:

{
"Version": "2012-10-17",
"Id": "statement1",
"Statement": [
    {
        "Sid": "statement1",
        "Effect": "Allow",
        "Principal": "*",
        "Action": "s3:*",
        "Resource": "arn:aws:s3:::example_bucket"
    }
]

}

Unfortunately I still get the following error in python:

boto.exception.S3ResponseError: S3ResponseError: 405 Method Not Allowed

Wondering if someone could help explain how to either correctly set the permissions in AWS S3 or configure pandas correctly to import the file. Thanks!

Sometimes we may need to read a csv file from amzon s3 bucket directly , we can achieve this by using several methods, in that most common way is by using csv module.

import csv at the top of python file

import csv

then the function and code looks like this

s3 = boto3.client[
's3',
aws_access_key_id='XYZACCESSKEY',
aws_secret_access_key='XYZSECRETKEY',
region_name='us-east-1'
] #1

obj = s3.get_object[Bucket='bucket-name', Key='myreadcsvfile.csv'] #2
data = obj['Body'].read[].decode['utf-8'].splitlines[] #3
records = csv.reader[data] #4
headers = next[records] #5
print['headers: %s' % [headers]]
for eachRecord in records: #6
print[eachRecord]

#1 — creating an object for s3 client with s3 access key , secret key and region [just assuming , reader already know what is access key and secret key.]

#2 — getting an object for our bucket name along with the file name of csv file.

In some cases we may not have csv file directly in s3 bucket , we may have folders and inside folders to get csv file , at that scenario the #2 line should change like below

obj = s3.get_object[Bucket='bucket-name', Key='folder/subfoler/myreadcsvfile.csv']

#3 — with the second line we got hand on object of csv file , now we need to read it , and the data will be in binary format so we are using decode[] function to convert it into readable format.

then we are using splitlines[] function to split each row as one record

#4 — now we are using csv.reader[data] to read the above data from line #3

with this we almost got the data , we just need to seperate headers and actual data

#5 — with this we will get all the headers of that entire csv file.

#6 — by using for loop , we are iterating through each record and printing each row of the csv files.

After getting the data we don’t want the data and headers to be in separate places , we want combined data saying which value belongs to which header. Let’s do it now , take one array variable before for loop

csvData = []
headerCount = len[headers]

and change the for loop like this

for eachRecord in records:
tmp = {}
for count in range[0,headerCount]:
tmp[headers[count]] = line[count]
csvData.append[tmp]
print[csvData]

now csvData contains the data in the below for

[{‘id’: ‘1’, ‘name’: ‘Jack’,‘age’: ‘24’},{‘id’: ‘2’, ‘name’: ‘Stark’,‘age’: ‘29’}]

Note: I formatted data in this format as it is my requirement , based on one’s requirement formatting data can be changed.

Hope this helped!, Happy Coding and Reading

Shi Han

Posted on Aug 22, 2020 • Updated on Sep 8, 2020

Here is a scenario. There is a huge CSV file on Amazon S3. We need to write a Python function that downloads, reads, and prints the value in a specific column on the standard output [stdout].

Simple Googling will lead us to the answer to this assignment in Stack Overflow. The code should look like something like the following:

import codecs
import csv

import boto3


client = boto3.client["s3"]

def read_csv_from_s3[bucket_name, key, column]:
    data = client.get_object[Bucket=bucket_name, Key=key]

    for row in csv.DictReader[codecs.getreader["utf-8"][data["Body"]]]:
        print[row[column]]

We will explore the solution above in detail in this article. Imagine this like a rubber duck programming and you are the rubber duck in this case.

Downloading File from S3

Let's get started. First, we need to figure out how to download a file from S3 in Python. The official AWS SDK for Python is known as Boto3. According to the documentation, we can create the client instance for S3 by calling boto3.client["s3"]. Then we call the get_object[] method on the client with bucket name and key as input arguments to download a specific file.

Now the thing that we are interested in is the return value of the get_object[] method call. The return value is a Python dictionary. In the Body key of the dictionary, we can find the content of the file downloaded from S3. The body data["Body"] is a botocore.response.StreamingBody. Hold that thought.

Reading CSV File

Let's switch our focus to handling CSV files. We want to access the value of a specific column one by one. csv.DictReader from the standard library seems to be an excellent candidate for this job. It returns an iterator [the class implements the iterator methods __iter__[] and __next__[]] that we can use to access each row in a for-loop: row[column]. But what should we pass into X as an argument? According to the documentation, we should refer to the reader instance.

All other optional or keyword arguments are passed to the underlying reader instance.

There we can see that the first argument csvfile

can be any object which supports the iterator protocol and returns a string each time its next[] method is called

botocore.response.StreamingBody supports the iterator protocol 🎉.

Unfortunately, it's __next__[] method does not return a string but bytes instead.

_csv.Error: iterator should return strings, not bytes [did you open the file in text mode?]

Reading CSV file from S3

So how do we bridge the gap between botocore.response.StreamingBody type and the type required by the cvs module? We want to "convert" the bytes to string in this case. Therefore, the codecs module of Python's standard library seems to be a place to start.

Most standard codecs are text encodings, which encode text to bytes

Since we are doing the opposite, we are looking for a "decoder," specifically a decoder that can handle stream data: codecs.StreamReader

Decodes data from the stream and returns the resulting object.

The codecs.StreamReader takes a file-like object as an input argument. In Python, this means the object should have a read[] method. The botocore.response.StreamingBody does have the read[] method: //botocore.amazonaws.com/v1/documentation/api/latest/reference/response.html#botocore.response.StreamingBody.read

Since the codecs.StreamReader also supports the iterator protocol, we can pass the object of this instance into the csv.DictReader: //github.com/python/cpython/blob/1370d9dd9fbd71e9d3c250c8e6644e0ee6534fca/Lib/codecs.py#L642-L651

The final piece of the puzzle is: How do we create the codecs.StreamReader? That's where the codecs.getreader[] function comes in play. We pass the codec of our choice [in this case, utf-8] into the codecs.getreader[], which creates thecodecs.StreamReader. This allows us to read the CSV file row-by-row into dictionary by passing the codec.StreamReader into csv.DictReader:

Thank you for following this long and detailed [maybe too exhausting] explanation of such a short program. I hope you find it useful. Thank your listening ❤️.

How do I read a CSV file from S3 bucket using pandas in Python?

AWS S3 Read Write Operations Using the Pandas' API.
Prerequisite libraries. import boto3. ... .
Read a CSV file using pandas. emp_df=pd.read_csv[r'D:\python_coding\GitLearn\python_ETL\emp.dat'] ... .
Write the Pandas DataFrame to AWS S3. from io import StringIO REGION = 'us-east-2' ... .
Read the AWS S3 file to Pandas DataFrame..

How do I read a csv file on Amazon S3?

Sometimes we may need to read a csv file from amzon s3 bucket directly , we can achieve this by using several methods, in that most common way is by using csv module. #1 — creating an object for s3 client with s3 access key , secret key and region [just assuming , reader already know what is access key and secret key.]

How Python read data from AWS S3?

How to Upload And Download Files From AWS S3 Using Python [2022].
Step 1: Setup an account. ... .
Step 2: Create a user. ... .
Step 3: Create a bucket. ... .
Step 4: Create a policy and add it to your user. ... .
Step 5: Download AWS CLI and configure your user. ... .
Step 6: Upload your files. ... .
Step 7: Check if authentication is working..

Chủ Đề