Get data from html python
IntroductionNowadays everyone is talking about data and how it is helping to learn hidden patterns and new insights. The right set of data can help a business to improve its marketing strategy and that can increase the overall sales. And let's not forget the popular example in which a politician can know the public's opinion before elections. Data is powerful, but it does not come for free. Gathering the right data is always expensive; think of surveys or marketing campaigns, etc. Show
The internet is a pool of data and, with the right set of skills, one can use this data in a way to gain a lot of new information. You can always copy paste the data to your excel or CSV file but that is also time-consuming and expensive. Why not hire a software developer who can get the data into a readable format by writing some jiber-jabber? Yes, it is possible to extract data from Web and this "jibber-jabber" is called Web Scraping. According to Wikipedia, Web Scraping is:
BeautifulSoup is one popular library provided by Python to scrape data from the web. To get the best out of it, one needs only to have a basic knowledge of HTML, which is covered in the guide. Components of a WebpageIf you know the basic HTML, you can skip this part. The basic syntax of any webpage is:
html Every tag in HTML can have attribute information (i.e., class, id, href, and other useful information) that helps in identifying the element uniquely. For more information about basic HTML tags, check out w3schools. Steps for Scraping Any WebsiteTo scrape a website using Python, you need to perform these four basic steps:
Understanding and Inspecting the DataNow that you know about basic HTML and its tags, you need to first do the inspection of the page which you want to scrape. Inspection is the most important job in web scraping; without knowing the structure of the webpage, it is very hard to get the needed information. To help with inspection, every browser like Google Chrome or Mozilla Firefox comes with a handy tool called developer tools. In this guide, we will be working with wikipedia to scrap some of its table data from the page List of countries by GDP (nominal). This page contains a Lists heading which contains three tables of countries sorted by their rank and its GDP value as per "International Monetary Fund", "World Bank", and "United Nations". Note, that these three tables are enclosed in an outer table. To know about any element that you wish to scrape, just right-click on that text and examine the tags and attributes of the element. Jump into the CodeIn this guide, we will be learning how to do a simple web scraping using Python and BeautifulSoup. Install the Essential Python Libraries
shell Note: If you are using Windows, use pip instead of pip3 Importing the Essential LibrariesImport the "requests" library to fetch the page content and bs4 (Beautiful Soup) for parsing the HTML page content.
python Collecting and Parsing a WebpageIn the next step, we will make a GET request to the url and will create a parse Tree object(soup) with the help of BeautifulSoup and Python built-in "lxml" parser.
python With our BeautifulSoup object i.e., soup we can move on and collect the required table data. Before going to the actual code, let's first play with the soup object and print some basic information from it: Example 1:Let’s just first print the title of the webpage. It will give an output as follows:
To get the text without the HTML tags, we just use .text:
python
Example 2:Now, let's get all the links in the page along with its attributes, such as href, title, and its inner Text.
python This will output all the available links along with its mentioned attributes from the page. Now, let's get back to the track and find our goal table. Analyzing the outer table, we can see that it has special attributes which include class as wikitable and has two tr tags inside tbody. If you uncollapse the tr tag, you will find that the first tr tag is for the headings of all three tables and the next tr tag is for the table data for all three inner tables. Let's first get all three table headings:
python This will give an output as:
Moving on to the second tr tag of the outer table, let's get the content of all the three tables by iterating over each table and its rows.
python Writing Data to CSVNow that we have created our data structure, we can export it to a CSV file by just iterating over it.
python Putting It TogetherLet's join all the above code snippets. Our complete code looks like this:
python BEWARE -> Scraping rulesNow that you have a basic idea about scraping with Python, it is important to know the Legality of web scraping before starting scraping a website. Generally, if you are using scraped data for personal use and do not plan to republish that data, it may not cause any problems. Read the Terms of Use, Conditions of Use, and also the robots.txt before scraping the website. You must follow the robots.txt rules before scraping, otherwise, the website owner has every right to take legal action against you. ConclusionThe above guide went through the process of how to scrape a Wikipedia page using Python3 and Beautiful Soup and finally exporting it to a CSV file. We have learned how to scrape a basic website and fetch all the useful data in just a couple of minutes. You can further continue to expand the awesomeness of the art of scraping by jumping for new websites. Some good examples of data to scrape are:
Beautiful Soup is simple for small-scale web scraping. If you want to scrape webpages on a large scale, you can consider more advanced techniques like Scrapy and Selenium. Here are the some of my scraping guides: Hope you like this guide. If you have any queries regarding this topic, feel free to contact me at CodeAlphabet. Useful LinksHow do I get data from HTML to Python?To scrape a website using Python, you need to perform these four basic steps:. Sending an HTTP GET request to the URL of the webpage that you want to scrape, which will respond with HTML content. ... . Fetching and parsing the data using Beautifulsoup and maintain the data in some data structure such as Dict or List.. How do I extract information from HTML?There are roughly 5 steps as below:. Inspect the website HTML that you want to crawl.. Access URL of the website using code and download all the HTML contents on the page.. Format the downloaded content into a readable format.. Extract out useful information and save it into a structured format.. How do I extract text from a URL in Python?URL extraction is achieved from a text file by using regular expression. The expression fetches the text wherever it matches the pattern. Only the re module is used for this purpose.
|