How do i import a dataset in python?


When running python programs, we need to use datasets for data analysis. Python has various modules which help us in importing the external data in various file formats to a python program. In this example we will see how to import data of various formats to a python program.

Import csv file

The csv module enables us to read each of the row in the file using a comma as a delimiter. We first open the file in read only mode and then assign the delimiter. Finally use a for loop to read each row from the csv file.

Example

import csv

with open("E:\customers.csv",'r') as custfile:
rows=csv.reader(custfile,delimiter=',')
for r in rows:
print(r)

Output

Running the above code gives us the following result −

['customerID', 'gender', 'Contract', 'PaperlessBilling', 'Churn']
['7590-VHVEG', 'Female', 'Month-to-month', 'Yes', 'No']
['5575-GNVDE', 'Male', 'One year', 'No', 'No']
['3668-QPYBK', 'Male', 'Month-to-month', 'Yes', 'Yes']
['7795-CFOCW', 'Male', 'One year', 'No', 'No']
……
…….

With pandas

The pandas library can actually handle most of the file types inclusing csv file. In this program let see how pandas library handles the excel file using the read_excel module. In the below example we read the excel version of the above file and get the same result when we read the file.

Example

import pandas as pd

df = pd.ExcelFile("E:\customers.xlsx")
data=df.parse("customers")
print(data.head(10))

Output

Running the above code gives us the following result −

  customerID   gender Contract            PaperlessBilling Churn
0 7590-VHVEG Female   Month-to-month         Yes            No
1 5575-GNVDE   Male   One year                No            No
2 3668-QPYBK   Male   Month-to-month         Yes           Yes
3 7795-CFOCW   Male   One year                No            No
4 9237-HQITU Female   Month-to-month        Yes            Yes
5 9305-CDSKC Female   Month-to-month        Yes            Yes
6 1452-KIOVK   Male   Month-to-month        Yes             No
7 6713-OKOMC Female   Month-to-month         No             No
8 7892-POOKP Female   Month-to-month        Yes            Yes
9 6388-TABGU   Male   One year               No             No

With pyodbc

We can also connect to database servers using a module called pyodbc. This will help us import data from relational sources using a sql query. Ofcourse we also have to define the connection details to the db before passing on the query.

Example

import pyodbc
sql_conn = pyodbc.connect("Driver={SQL Server};Server=serverName;UID=UserName;PWD=Password;Database=sqldb;")
data_sql = pd.read_sql_query(SQL QUERY’, sql_conn)
data_sql.head()

Output

Depending the SQL query the result will be displayed.

How do i import a dataset in python?

Updated on 10-Jul-2020 11:10:27

  • Related Questions & Answers
  • Importing data into models in Django
  • Importing data from Pgsql to SAP HANA database
  • Table type while importing data from flat file in SAP HANA
  • Why importing star is a bad idea in python
  • Different mapping options in SAP HANA while importing data from flat file
  • PHP Aliasing/Importing namespaces
  • Importing External Style Sheets in CSS
  • Importing / Exporting CSV file in PowerShell
  • JavaScript Importing and Exporting Modules
  • IMPORTING, EXPORTING and CHANGING Keywords in ABAP
  • How to check if a python module exists without importing it?
  • Importing/Exporting ABAP packages to Presentation server
  • Importing limited number of rows from CSV in SAP HANA
  • Importing an object from ECM application in SAP Business Workplace
  • What is the difference between importing matplotlib and matplotlib.pyplot?

This tutorial explains various methods to read data in Python. Data can be in any of the popular formats - CSV, TXT, XLS/XLSX (Excel), sas7bdat (SAS), Stata, Rdata (R) etc. Loading data in python environment is the most initial step of analyzing data.

Install and Load pandas Package

pandas is a powerful data analysis package. It makes data exploration and manipulation easy. It has several functions to read data from various sources.
If you are using Anaconda, pandas must be already installed. You need to load the package by using the following command -

import pandas as pd

If pandas package is not installed, you can install it by running the following code in Ipython Console. If you are using Spyder, you can submit the following code in Ipython console within Spyder.

!pip install pandas

If you are using Anaconda, you can try the following line of code to install pandas -

!conda install pandas

1. Import CSV files

It is important to note that a singlebackslash does not work when specifying the file path. You need to either change it to forward slash or add one more backslash like below

import pandas as pd
mydata= pd.read_csv("C:\\Users\\Deepanshu\\Documents\\file1.csv")

If no header (title) in raw data file

mydata1 = pd.read_csv("C:\\Users\\Deepanshu\\Documents\\file1.csv", header = None)

You need to include header = None option to tell Python there is no column name (header) in data.

Add Column Names

We can include column names by using names= option.

mydata2 = pd.read_csv("C:\\Users\\Deepanshu\\Documents\\file1.csv", header = None, names = ['ID', 'first_name', 'salary'])

The variable names can also be added separately by using the following command.

mydata1.columns = ['ID', 'first_name', 'salary']

2. Import File from URL

You don't need to perform additional steps to fetch data from URL. Simply put URL in read_csv() function (applicable only for CSV files stored in URL).

mydata = pd.read_csv("http://winterolympicsmedals.com/medals.csv")

3. Read Text File

We can use read_table() function to pull data from text file. We can also use read_csv() with sep= "\t" to read data from tab-separated file.

mydata = pd.read_table("C:\\Users\\Deepanshu\\Desktop\\example2.txt")
mydata = pd.read_csv("C:\\Users\\Deepanshu\\Desktop\\example2.txt", sep ="\t")

4. Read Excel File

The read_excel() function can be used to import excel data into Python.

mydata = pd.read_excel("https://www.eia.gov/dnav/pet/hist_xls/RBRTEd.xls",sheetname="Data 1", skiprows=2)

If you do not specify name of sheet in sheetname= option, it would take by default first sheet.

5. Read delimited file

Suppose you need to import a file that is separated with white spaces.

mydata2 = pd.read_table("http://www.ssc.wisc.edu/~bhansen/econometrics/invest.dat", sep="\s+", header = None)

To include variable names, use the names= option like below -

mydata3 = pd.read_table("http://www.ssc.wisc.edu/~bhansen/econometrics/invest.dat", sep="\s+", names=['a', 'b', 'c', 'd'])

6. Read SAS File

We can import SAS data file by using read_sas() function.

mydata4 = pd.read_sas('cars.sas7bdat')

If you have a large SAS File, you can try package named pyreadstat which is faster than pandas. It is equivalent to haven package in R which provides easy and fast way to read data from SAS, SPSS and Stata. To install this package, you can use the command pip install pyreadstat

import pyreadstat
df, meta = pyreadstat.read_sas7bdat('cars.sas7bdat')

# done! let's see what we got
print(df.head())
print(meta.column_names)
print(meta.column_labels)
print(meta.number_rows)
print(meta.number_columns)

7. Read Stata File

We can load Stata data file via read_stata() function.

mydata41 = pd.read_stata('cars.dta')

pyreadstat package lets you to pull value labels from stata files.

import pyreadstat
df, meta = pyreadstat.read_dta("cars.dta")

To get labels, set apply_value_formats as TRUE

df, meta = pyreadstat.read_dta("cars.dta", apply_value_formats=True)

8. Import R Data File

Using pyreadr package, you can load .RData and .Rds format files which in general contains R data frame. You can install this package using the command below -

pip install pyreadr

With the use of read_r( ) function, we can import R data format files.

import pyreadr
result = pyreadr.read_r('C:/Users/sampledata.RData')
print(result.keys()) # let's check what objects we got
df1 = result["df1"] # extract the pandas data frame for object df1

Similarly, you can read .Rds formatted file.

9. Read SQL Table

We can extract table from SQL database (SQL Server / Teradata). See the program below -

SQL Server

You can read data from tables stored in SQL Server by building a connection. You need to have server, User ID (UID), database details to establish connection.

import pandas as pd
import pyodbc 
conn = pyodbc.connect("Driver={SQL Server};Server=serverName;UID=UserName;PWD=Password;Database=RCO_DW;")
df = pd.read_sql_query('select * from dbo.Table WHERE ID > 10', conn)
df.head()

Teradata

You need to import Teradata module which makes python easily integrated with Teradata Database.

import pandas as pd
import teradata
udaExec = teradata.UdaExec(appName="HelloWorld", version="1.0",
                           logConsole=False)
session = udaExec.connect(method="odbc",
                          USEREGIONALSETTINGS="N",
                          system="tdprod",
                          username="xxx",
                          password="xxx");

query = "SELECT * FROM flight"
df = pd.read_sql(query , session)

Explanation

  • UdaExec provides DevOps support features such as configuration and logging.
  • You can assign any name and version in appName and version
  • logConsole=False tells Python not to log to the console.
  • system="tdprod" refers to name of the system we are connecting using ODBC as the connection method
  • USEREGIONALSETTINGS="N" is used to ensure that float values can be loaded and make decimal separator be ‘.’

Suppose you have .db extension file which is a database file and you want to extract data from it.

import sqlite3
from pandas.io import sql
conn = sqlite3.connect('C:/Users/Deepanshu/Downloads/flight.db')
query = "SELECT * FROM flight"
results = pd.read_sql(query, con=conn)
print results.head()

10. Import Data from SPSS File

import pyreadstat
df, meta = pyreadstat.read_sav("file.sav", apply_value_formats=True)

If you don't want value labels, make apply_value_formats as False.

11. Read sample of rows and columns

By specifying nrows= and usecols=, you can fetch specified number of rows and columns.

mydata7 = pd.read_csv("http://winterolympicsmedals.com/medals.csv", nrows=5, usecols=(1,5,7))

nrows = 5 implies you want to import only first 5 rows and usecols= refers to specified columns you want to import.

12. Skip rows while importing

Suppose you want to skip first 5 rows and wants to read data from 6th row (6th row would be a header row)

mydata8 = pd.read_csv("http://winterolympicsmedals.com/medals.csv", skiprows=5)

13. Specify values as missing values

By including na_values= option, you can specify values as missing values. In this case, we are telling python to consider dot (.) as missing cases.

mydata9 = pd.read_csv("workingfile.csv", na_values=['.'])

How do I import my own dataset in Python?

5 Different Ways to Load Data in Python.
Manual function..
loadtxt function..
genfromtxt function..
read_csv function..
Pickle..

How do I import a dataset?

Navigate to the "My Datasets" page by clicking on the corresponding link in the left side of the Public Data Explorer site. Click on the "Choose file" button, and select the location of the zipped, DSPL bundle on your local system. Click the "Upload dataset" button to begin the import process.

How do I import a CSV dataset into Python?

Read A CSV File Using Python.
Using the CSV Library. import csv with open("./bwq.csv", 'r') as file: csvreader = csv.reader(file) for row in csvreader: print(row) Here we are importing the csv library in order to use the . ... .
Using the Pandas Library. import pandas as pd data = pd.read_csv("bwq.csv") data..

How do I import a dataset to pandas?

Pandas Read CSV.
Load the CSV into a DataFrame: import pandas as pd. df = pd.read_csv('data.csv') ... .
Print the DataFrame without the to_string() method: import pandas as pd. ... .
Check the number of maximum returned rows: import pandas as pd. ... .
Increase the maximum number of rows to display the entire DataFrame: import pandas as pd..