Get data from html python

Introduction

Nowadays everyone is talking about data and how it is helping to learn hidden patterns and new insights. The right set of data can help a business to improve its marketing strategy and that can increase the overall sales. And let's not forget the popular example in which a politician can know the public's opinion before elections. Data is powerful, but it does not come for free. Gathering the right data is always expensive; think of surveys or marketing campaigns, etc.

The internet is a pool of data and, with the right set of skills, one can use this data in a way to gain a lot of new information. You can always copy paste the data to your excel or CSV file but that is also time-consuming and expensive. Why not hire a software developer who can get the data into a readable format by writing some jiber-jabber? Yes, it is possible to extract data from Web and this "jibber-jabber" is called Web Scraping.

According to Wikipedia, Web Scraping is:

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites

BeautifulSoup is one popular library provided by Python to scrape data from the web. To get the best out of it, one needs only to have a basic knowledge of HTML, which is covered in the guide.

Components of a Webpage

If you know the basic HTML, you can skip this part.

The basic syntax of any webpage is:

1DOCTYPE html>  
2<html markdown="1">  
3    <head>
4    <meta charset="utf-8" />
5    <meta http-equiv="X-UA-Compatible" content="IE=edge" />
6    head>
7    <body>
8        <h2 class = "heading"> My first Web Scraping with Beautiful soup h2>
9        <p>Let's scrap the website using python. p>
10    <body>
11html>

html

Every tag in HTML can have attribute information (i.e., class, id, href, and other useful information) that helps in identifying the element uniquely.

For more information about basic HTML tags, check out w3schools.

Steps for Scraping Any Website

To scrape a website using Python, you need to perform these four basic steps:

  • Sending an HTTP GET request to the URL of the webpage that you want to scrape, which will respond with HTML content. We can do this by using the Request library of Python.

  • Fetching and parsing the data using Beautifulsoup and maintain the data in some data structure such as Dict or List.

  • Analyzing the HTML tags and their attributes, such as class, id, and other HTML tag attributes. Also, identifying your HTML tags where your content lives.

  • Outputting the data in any file format such as CSV, XLSX, JSON, etc.

Understanding and Inspecting the Data

Now that you know about basic HTML and its tags, you need to first do the inspection of the page which you want to scrape. Inspection is the most important job in web scraping; without knowing the structure of the webpage, it is very hard to get the needed information. To help with inspection, every browser like Google Chrome or Mozilla Firefox comes with a handy tool called developer tools.

In this guide, we will be working with wikipedia to scrap some of its table data from the page List of countries by GDP (nominal). This page contains a Lists heading which contains three tables of countries sorted by their rank and its GDP value as per "International Monetary Fund", "World Bank", and "United Nations". Note, that these three tables are enclosed in an outer table.

To know about any element that you wish to scrape, just right-click on that text and examine the tags and attributes of the element.

Get data from html python

Jump into the Code

In this guide, we will be learning how to do a simple web scraping using Python and BeautifulSoup.

Install the Essential Python Libraries

1pip3 install requests beautifulsoup4 

shell

Note: If you are using Windows, use pip instead of pip3

Importing the Essential Libraries

Import the "requests" library to fetch the page content and bs4 (Beautiful Soup) for parsing the HTML page content.

1from bs4 import BeautifulSoup
2import requests

python

Collecting and Parsing a Webpage

In the next step, we will make a GET request to the url and will create a parse Tree object(soup) with the help of BeautifulSoup and Python built-in "lxml" parser.

1# importing the libraries
2from bs4 import BeautifulSoup
3import requests
4
5url="https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)"
6
7# Make a GET request to fetch the raw HTML content
8html_content = requests.get(url).text
9
10# Parse the html content
11soup = BeautifulSoup(html_content, "lxml")
12print(soup.prettify()) # print the parsed data of html

python

With our BeautifulSoup object i.e., soup we can move on and collect the required table data.

Before going to the actual code, let's first play with the soup object and print some basic information from it:

Example 1:

Let’s just first print the title of the webpage.

It will give an output as follows:

1List of countries by GDP (nominal) - Wikipedia

To get the text without the HTML tags, we just use .text:

1print(soup.title.text)

python

1List of countries by GDP (nominal) - Wikipedia

Example 2:

Now, let's get all the links in the page along with its attributes, such as href, title, and its inner Text.

1for link in soup.find_all("a"):
2    print("Inner Text: {}".format(link.text))
3    print("Title: {}".format(link.get("title")))
4    print("href: {}".format(link.get("href")))

python

This will output all the available links along with its mentioned attributes from the page.

Now, let's get back to the track and find our goal table.

Analyzing the outer table, we can see that it has special attributes which include class as wikitable and has two tr tags inside tbody.

Get data from html python

If you uncollapse the tr tag, you will find that the first tr tag is for the headings of all three tables and the next tr tag is for the table data for all three inner tables.

Let's first get all three table headings:

Note that we are removing the newlines and spaces from left and right of the text by using simple strings methods available in Python.

1gdp_table = soup.find("table", attrs={"class": "wikitable"})
2gdp_table_data = gdp_table.tbody.find_all("tr")  # contains 2 rows
3
4# Get all the headings of Lists
5headings = []
6for td in gdp_table_data[0].find_all("td"):
7    # remove any newlines and extra spaces from left and right
8    headings.append(td.b.text.replace('\n', ' ').strip())
9
10print(headings)

python

This will give an output as:

1['Per the International Monetary Fund (2018)', 'Per the World Bank (2017)', 'Per the United Nations (2017)']

Moving on to the second tr tag of the outer table, let's get the content of all the three tables by iterating over each table and its rows.

Get data from html python

1data = {}
2for table, heading in zip(gdp_table_data[1].find_all("table"), headings):
3    # Get headers of table i.e., Rank, Country, GDP.
4    t_headers = []
5    for th in table.find_all("th"):
6        # remove any newlines and extra spaces from left and right
7        t_headers.append(th.text.replace('\n', ' ').strip())
8    # Get all the rows of table
9    table_data = []
10    for tr in table.tbody.find_all("tr"): # find all tr's from table's tbody
11        t_row = {}
12        # Each table row is stored in the form of
13        # t_row = {'Rank': '', 'Country/Territory': '', 'GDP(US$million)': ''}
14
15        # find all td's(3) in tr and zip it with t_header
16        for td, th in zip(tr.find_all("td"), t_headers): 
17            t_row[th] = td.text.replace('\n', '').strip()
18        table_data.append(t_row)
19
20    # Put the data for the table with his heading.
21    data[heading] = table_data
22
23print(data)

python

Writing Data to CSV

Now that we have created our data structure, we can export it to a CSV file by just iterating over it.

1import csv
2
3for topic, table in data.items():
4    # Create csv file for each table
5    with open(f"{topic}.csv", 'w') as out_file:
6        # Each 3 table has headers as following
7        headers = [ 
8            "Country/Territory",
9            "GDP(US$million)",
10            "Rank"
11        ] # == t_headers
12        writer = csv.DictWriter(out_file, headers)
13        # write the header
14        writer.writeheader()
15        for row in table:
16            if row:
17                writer.writerow(row)

python

Get data from html python

Putting It Together

Let's join all the above code snippets.

Our complete code looks like this:

1# importing the libraries
2from bs4 import BeautifulSoup
3import requests
4import csv
5
6
7# Step 1: Sending a HTTP request to a URL
8url = "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)"
9# Make a GET request to fetch the raw HTML content
10html_content = requests.get(url).text
11
12
13# Step 2: Parse the html content
14soup = BeautifulSoup(html_content, "lxml")
15# print(soup.prettify()) # print the parsed data of html
16
17
18# Step 3: Analyze the HTML tag, where your content lives
19# Create a data dictionary to store the data.
20data = {}
21#Get the table having the class wikitable
22gdp_table = soup.find("table", attrs={"class": "wikitable"})
23gdp_table_data = gdp_table.tbody.find_all("tr")  # contains 2 rows
24
25# Get all the headings of Lists
26headings = []
27for td in gdp_table_data[0].find_all("td"):
28    # remove any newlines and extra spaces from left and right
29    headings.append(td.b.text.replace('\n', ' ').strip())
30
31# Get all the 3 tables contained in "gdp_table"
32for table, heading in zip(gdp_table_data[1].find_all("table"), headings):
33    # Get headers of table i.e., Rank, Country, GDP.
34    t_headers = []
35    for th in table.find_all("th"):
36        # remove any newlines and extra spaces from left and right
37        t_headers.append(th.text.replace('\n', ' ').strip())
38    
39    # Get all the rows of table
40    table_data = []
41    for tr in table.tbody.find_all("tr"): # find all tr's from table's tbody
42        t_row = {}
43        # Each table row is stored in the form of
44        # t_row = {'Rank': '', 'Country/Territory': '', 'GDP(US$million)': ''}
45
46        # find all td's(3) in tr and zip it with t_header
47        for td, th in zip(tr.find_all("td"), t_headers): 
48            t_row[th] = td.text.replace('\n', '').strip()
49        table_data.append(t_row)
50
51    # Put the data for the table with his heading.
52    data[heading] = table_data
53
54
55# Step 4: Export the data to csv
56"""
57For this example let's create 3 seperate csv for 
583 tables respectively
59"""
60for topic, table in data.items():
61    # Create csv file for each table
62    with open(f"{topic}.csv", 'w') as out_file:
63        # Each 3 table has headers as following
64        headers = [ 
65            "Country/Territory",
66            "GDP(US$million)",
67            "Rank"
68        ] # == t_headers
69        writer = csv.DictWriter(out_file, headers)
70        # write the header
71        writer.writeheader()
72        for row in table:
73            if row:
74                writer.writerow(row)

python

BEWARE -> Scraping rules

Now that you have a basic idea about scraping with Python, it is important to know the Legality of web scraping before starting scraping a website. Generally, if you are using scraped data for personal use and do not plan to republish that data, it may not cause any problems. Read the Terms of Use, Conditions of Use, and also the robots.txt before scraping the website. You must follow the robots.txt rules before scraping, otherwise, the website owner has every right to take legal action against you.

Conclusion

The above guide went through the process of how to scrape a Wikipedia page using Python3 and Beautiful Soup and finally exporting it to a CSV file. We have learned how to scrape a basic website and fetch all the useful data in just a couple of minutes.

You can further continue to expand the awesomeness of the art of scraping by jumping for new websites. Some good examples of data to scrape are:

  • Customer reviews and product pages

Beautiful Soup is simple for small-scale web scraping. If you want to scrape webpages on a large scale, you can consider more advanced techniques like Scrapy and Selenium.

Here are the some of my scraping guides:

    Hope you like this guide. If you have any queries regarding this topic, feel free to contact me at CodeAlphabet.

      How do I get data from HTML to Python?

      To scrape a website using Python, you need to perform these four basic steps:.
      Sending an HTTP GET request to the URL of the webpage that you want to scrape, which will respond with HTML content. ... .
      Fetching and parsing the data using Beautifulsoup and maintain the data in some data structure such as Dict or List..

      How do I extract information from HTML?

      There are roughly 5 steps as below:.
      Inspect the website HTML that you want to crawl..
      Access URL of the website using code and download all the HTML contents on the page..
      Format the downloaded content into a readable format..
      Extract out useful information and save it into a structured format..

      How do I extract text from a URL in Python?

      URL extraction is achieved from a text file by using regular expression. The expression fetches the text wherever it matches the pattern. Only the re module is used for this purpose.