What is python web crawling?
Web scraping is the process of collecting and parsing raw data from the Web, and the Python community has come up with some pretty powerful web scraping tools. Show
The Internet hosts perhaps the greatest source of information—and misinformation—on the planet. Many disciplines, such as data science, business intelligence, and investigative reporting, can benefit enormously from collecting and analyzing data from websites. In this tutorial, you’ll learn how to:
Scrape and Parse Text From WebsitesCollecting data from websites using an automated process is known as web scraping. Some websites explicitly forbid users from scraping their data with automated tools like the ones you’ll create in this tutorial. Websites do this for two possible reasons:
Let’s start by grabbing all the HTML code from a single web page. You’ll use a page on Real Python that’s been set up for use with this tutorial. Your First Web ScraperOne useful package for web scraping that you can find in Python’s standard library is In IDLE’s interactive window, type the following to import >>>
The web page that we’ll open is at the following URL: >>>
To open the web page, pass >>>
>>>
To extract the HTML from the page, first use the >>>
Now you can print the HTML to see the contents of the web page: >>>
Once you have the HTML as text, you can extract information from it in a couple of different ways. A Primer on Regular ExpressionsRegular expressions—or regexes for short—are patterns that can be used to search for text within a string. Python supports regular expressions through the standard library’s To work with regular
expressions, the first thing you need to do is import the Regular expressions use special characters called metacharacters to denote different patterns. For instance, the asterisk character ( In the following example, you use >>>
The first argument of The regular expression Here’s the same pattern applied to different strings: >>>
Notice that if no match is found, then Pattern matching is case sensitive. If you want to match this pattern regardless of the case, then
you can pass a third argument with the value >>>
You can use a period ( >>>
The pattern >>>
Often, you use The details of the >>>
There’s one more function in the The arguments passed to >>>
Perhaps that wasn’t quite what you expected to happen.
Alternatively, you can use the non-greedy matching pattern >>>
This time, Check Your UnderstandingExpand the block below to check your understanding. Write a program that grabs the full HTML from the following URL: >>>
Then
use You can expand the block below to see a solution. First, import the
Then open the URL and use the
Now that you have the HTML source of the web page as a string assigned to the You can get the name by finding the string The following
It looks like there’s a lot going on in this
At the end of the loop, you use This solution is one of many that solves this problem, so if you got the same output with a different solution, then you did great! When you’re ready, you can move on to the next section. Use an HTML Parser for Web Scraping in PythonAlthough regular expressions are great for pattern matching in general, sometimes it’s easier to use an HTML parser that’s explicitly designed for parsing out HTML pages. There are many Python tools written for this purpose, but the Beautiful Soup library is a good one to start with. Install Beautiful SoupTo install Beautiful Soup, you can run the following in your terminal:
Run
In particular, notice that the latest version at the time of writing was 4.9.1. Create a BeautifulSoup ObjectType the following program into a new editor window:
This program does three things:
The Use a BeautifulSoup ObjectSave and run the above program. When it’s finished running, you can use the For example, Type the following code into IDLE’s interactive window: >>>
There are a lot of blank lines in this output. These are the result of newline characters in the HTML document’s text. You can remove them with the string Often, you need to get only specific text from an HTML document. Using Beautiful Soup first to extract the text and then using the However, sometimes the HTML tags themselves are the elements that point out the data you want to retrieve. For instance, perhaps you want to retrieve the URLs for all the images on the page. These links are contained in the This returns a list of all To get the source of the
images in the Dionysus profile page, you access the >>>
Certain tags in HTML documents can be accessed by properties of the >>>
If you look at the
source of the Dionysus profile by navigating to the profile page, right-clicking on the page, and selecting View page source, then you’ll notice that the
Beautiful Soup automatically cleans up the tags for you by removing the extra space in the opening tag and the extraneous forward slash ( You
can also retrieve just the string between the title tags with the >>>
One of the more useful features of Beautiful Soup is the ability to search for specific kinds of tags whose attributes match certain values. For example, if you want to find all the This example is somewhat arbitrary, and the usefulness of this technique may not be apparent from the example. If you spend some time browsing various websites and viewing their page sources, then you’ll notice that many websites have extremely complicated HTML structures. When scraping data from websites with Python, you’re often interested in particular parts of the page. By spending some time looking through the HTML document, you can identify tags with unique attributes that you can use to extract the data you need. Then, instead of relying on complicated regular expressions or using In some cases, you may find that Beautiful Soup doesn’t offer the functionality you need. The lxml library is somewhat trickier to get started with but offers far more flexibility than Beautiful Soup for parsing HTML documents. You may want to check it out once you’re comfortable using Beautiful Soup. BeautifulSoup is great for scraping data from a website’s HTML, but it doesn’t provide any way to work with HTML forms. For example, if you need to search a website for some query and then scrape the results, then BeautifulSoup alone won’t get you very far. Check Your UnderstandingExpand the block below to check your understanding. Write a program that grabs the full HTML from the
page at the URL Using Beautiful Soup, print out a list of all the links on the page by looking for HTML tags with the name The final output should look like this:
You can expand the block below to see a solution: First, import
the
Each link URL on the
You can build a full URL by concatenating Now open the
With the HTML source downloaded and decoded, you can create a new
The relative URL for each link can be accessed through the When you’re ready, you can move on to the next section. Interact With HTML FormsThe The Python standard library doesn’t provide a built-in means for working with web pages interactively, but many third-party packages are available from PyPI. Among these, MechanicalSoup is a popular and relatively straightforward package to use. In essence, MechanicalSoup installs what’s known as a headless browser, which is a web browser with no graphical user interface. This browser is controlled programmatically via a Python program. Install MechanicalSoupYou can install MechanicalSoup with
You can
now view some details about the package with
In particular, notice that the latest version at the time of writing was 0.12.0. You’ll need to close and restart your IDLE session for MechanicalSoup to load and be recognized after it’s been installed. Create a Browser ObjectType the following into IDLE’s interactive window: >>>
>>>
>>>
The number MechanicalSoup uses Beautiful Soup to parse the HTML from the request. >>>
You can view the HTML by inspecting the >>>
Notice this page has a Submit a Form With MechanicalSoupOpen the
However, if you provide the correct login credentials (username In the next example, you’ll see how to use MechanicalSoup to fill out and submit this form using Python! The important section of HTML code is the login form—that is, everything inside the Now that you know the underlying structure of the login form, as well as the credentials needed to log in, let’s take a look at a program. that fills the form out and submits it. In a new editor window, type in the following program:
Save the file and press F5 to run it. You can confirm that you successfully logged in by typing the following into the interactive window: >>>
Let’s break down the above example:
In the interactive window, you confirm that the submission successfully redirected to the Now that we have the To do this, you use >>>
Now you can iterate over each link and print the >>>
The URLs contained in each In this case, the base URL is just >>>
You can do a lot with just Check Your UnderstandingExpand the block below to check your understanding Use MechanicalSoup to provide the correct username ( Once the form is submitted, display the title of the current page to determine that you’ve been redirected to the
Your program should print the text You can expand the block below to see a solution. First, import the
Point the browser to the login page by passing the URL to
Now that the form is filled out, you can submit it with
If you filled the form with the correct username and password, then
You should see the following text displayed:
If instead you see the text When you’re ready, you can move on to the next section. Interact With Websites in Real TimeSometimes you want to be able to fetch real-time data from a website that offers continually updated information. In the dark days before you learned Python programming, you had to sit in front of a browser, clicking the Refresh button to reload the page each time you wanted to check if updated content was available. But now
you can automate this process using the Open your browser of choice and navigate to the URL The first thing you need to do is determine which element on the page
contains the result of the die roll. Do this now by right-clicking anywhere on the page and selecting View page source. A little more than halfway down the HTML code is an The text of the Let’s start by writing a simple program that opens the
This example uses the To periodically get a new result, you’ll need to create a loop that loads the page at each step. So everything below the
line For this example, let’s get four rolls of the dice at ten-second intervals. To do that, the last line of your code needs to tell Python to pause running for ten seconds. You can do this with Here’s an example that illustrates how
When you run this code, you’ll see that the For the die roll example, you’ll need to pass the number
When you run the program, you’ll immediately see the first result printed to the console. After ten seconds, the second result is displayed, then the third, and finally the fourth. What happens after the fourth result is printed? The program continues running for another ten seconds before it finally stops! Well, of course it does—that’s what you told it to do! But it’s kind of a waste of time. You can stop it from doing this by using an
With techniques like this, you can scrape data from websites that periodically update their data. However, you should be aware that requesting a page multiple times in rapid succession can be seen as suspicious, or even malicious, use of a website. It’s even possible to crash a server with an excessive number of requests, so you can imagine that many websites are concerned about the volume of requests to their server! Always check the Terms of Use and be respectful when sending multiple requests to a website. ConclusionAlthough it’s possible to parse data from the Web using tools in Python’s standard library, there are many tools on PyPI that can help simplify the process. In this tutorial, you learned how to:
Writing automated web scraping programs is fun, and the Internet has no shortage of content that can lead to all sorts of exciting projects. Just remember, not everyone wants you pulling data from their web servers. Always check a website’s Terms of Use before you start scraping, and be respectful about how you time your web requests so that you don’t flood a server with traffic. Additional ResourcesFor more information on web scraping with Python, check out the following resources:
What is web crawling used for?A web crawler, or spider, is a type of bot that is typically operated by search engines like Google and Bing. Their purpose is to index the content of websites all across the Internet so that those websites can appear in search engine results.
What is web crawling process?What is web crawling? Web crawling is the process of indexing data on web pages by using a program or automated script. These automated scripts or programs are known by multiple names, including web crawler, spider, spider bot, and often shortened to crawler.
How do you create a web crawler in Python?Building a Web Crawler using Python. a name for identifying the spider or the crawler, “Wikipedia” in the above example.. a start_urls variable containing a list of URLs to begin crawling from. ... . a parse() method which will be used to process the webpage to extract the relevant and necessary content.. What is the difference between web scraping and crawling?The short answer is that web scraping is about extracting the data from one or more websites. While crawling is about finding or discovering URLs or links on the web. Usually, in web data extraction projects, you need to combine crawling and scraping.
|