How to do web crawling in Python (2024)

Hi! We're Apify, a full-stack web scraping platform, and the masterminds behind Crawlee, a complete open-source web crawling and browser automation library. Check us out.

What is web crawling?

Web crawling is a process in which automated programs, commonly known as crawlers or spiders, systematically browse websites to find and index their content. Search engines such as Google, Yahoo, and Bing rely heavily on web crawling to understand the web and provide relevant search results to users.

So, crawlers visit web pages, extract information, and follow links to discover new pages. This collected data can be utilized for various purposes, including:

Search engine indexing:Crawlers power search engines by understanding website content and structure, enabling them to rank and display relevant results for user queries.
Data extraction:Web crawlers can extract specific website data for analysis or research. Businesses can leverage this to track competitor pricing and adjust their own accordingly.
Website monitoring:Crawlers can monitor websites for updates or changes in content.

How doesweb crawling differ from web scraping? Web scraping is a technique used to extract data from a webpage by making a request to the target website's link.

Unlike web crawling, which involves exploring and collecting multiple URLs on a page, web scraping only focuses on the known URL and extracts the data available on that page.

Why use Python for web crawling?

Python is a highly popular programming language for web crawling tasks due to its simplicity and rich ecosystem. It offers a vast range of libraries and frameworks specifically designed for web crawling and data extraction, including popular ones likeRequests,BeautifulSoup,Scrapy, andSelenium.

Once you’ve extracted your data, you can use Python's data science ecosystem to analyze it further. Libraries like Pandas provide powerful tools for data cleaning, manipulation, and analysis. Additionally, libraries like Matplotlib and Seaborn can help you create stunning data visualizations to help you better understand the insights hidden within your extracted data.

This makes Python a great choice for web scrapers.

Understanding Python web crawlers

A web crawler starts with a list of URLs to visit, called seeds. These seeds serve as the entry point for any web crawler. For each URL, the crawler makes HTTP requests and downloads the HTML content from the page. This raw data is then parsed to extract valuable data, such as links to other pages. The new links will be added to a queue for future exploration, and the other data will be stored to be processed in a separate pipeline.

The crawler will then make GET requests to these new links to repeat the same process as it did with the seed URL. This recursive process enables the script to visit every URL on the domain and gather all the available information.

Building a Python web crawler

Building a basic web crawler in Python requires two libraries: one to download the HTML content from a URL and another to parse it and extract links.

In my experience, the combination ofRequestsandBeautifulSoupis an excellent choice for this task. Requests, an HTTP library, simplifies sending HTTP requests and fetching web pages. BeautifulSoup then parses the content retrieved by Requests.

Apart from this, Python providesScrapy, a complete web crawling framework for building scalable and efficient crawlers. Though, we’ll look into it in the later sections.

Let's build a Python web crawler using Requests and BeautifulSoup. For this tutorial, we'll use theBooks to Scrapeas a target website to perform crawling.

Prerequisites

Before you start, make sure you meet all the following requirements:

Download the latest version of Python from theofficial website. For this tutorial, we’re usingPython 3.12.2.
Choose a code editor like Visual Studio Codeor PyCharm,or you can use an interactive environment such as Jupyter Notebook.

Let’s start by creating a virtual environment using the venv module.

python -m venv myenvmyenv\\Scripts\\activate

Install the following libraries:

pip install requests beautifulsoup4 lxml

Crawling script

Create a new Python file namedmain.pyand import the project dependencies:

from urllib.parse import urljoinimport requestsfrom bs4 import BeautifulSoup

We're usingurljoinfrom theurllib.parselibrary to join the base URL and crawled URL to form an absolute URL for further crawling. As it's part of the Python standard library, there's no need for installation. We’ll look at the base URL and crawled URL later in this section.

Use therequestslibrary to download the first page.

class MyWebCrawler: def __init__(self): # Initialize necessary variables pass def navigate(self): html_content = requests.get("https://books.toscrape.com/").text

The variablehtml_contentcontains the HTML data that was retrieved from the server. You can parse the data using BeautifulSoup, with thelxmloption specifying the parser library to use. All the parsed data will be stored in asoupvariable.

class MyWebCrawler: # ... def navigate(self, url): # ... soup = BeautifulSoup(html_content, "lxml")

Now, find all anchor tags in the parsed data with anhrefattribute and iterate through them to extract other relevant URLs within the page.

class MyWebCrawler: # ... def navigate(self, url): # ... for anchor_tag in soup.find_all("a", href=True): link = urljoin(url, anchor_tag["href"])

Theurljoinfunction combines the base URL with the relative URL found in thehrefattribute, resulting in the full, absolute URL for the linked webpage.

For example, (https://books.toscrape.com/, catalogue/category/books1/index.html) would be combined to form an absolute URL:https://books.toscrape.com/catalogue/category/books1/index.html.

The extracted link should not be present in thevisited_urlslist, which stores all the links previously visited by the crawler. It should also be absent from theurls_to_visitqueue, which contains URLs scheduled for future crawling. Once valid links are extracted from a page, they are added to theurls_to_visitqueue.

class MyWebCrawler: # ... def navigate(self, url): # ... if link not in self.visited_urls and link not in self.urls_to_visit: self.urls_to_visit.append(link)

The while loop continues processing URLs in theurls_to_visitqueue. Here's what happens for each URL:

Dequeue the URL from theurls_to_visitlist.
Callnavigate()to fetch the page and enqueue new links tourls_to_visit.
Mark the dequeued URL as visited by adding it tovisited_urls.

class MyWebCrawler: # ... # ... def start(self): while self.urls_to_visit: current_url = self.urls_to_visit.pop(0) self.navigate(current_url) self.visited_urls.append(current_url)

Here’s the complete code.

Extract data

You can extend login to perform web scraping. This will allow you to extract product data and save it in a dictionary while crawling the web. By using CSS selectors, you can extract the book title, price, and availability and save this information in the dictionary.

self.book_details = []book_containers = soup.find_all("article", class_="product_pod")for container in book_containers: title = container.find("h3").find("a")["title"] price = container.find("p", class_="price_color").text availability = container.find("p", class_="instock availability").text.strip() self.book_details.append( {"title": title, "price": price, "availability": availability} )

The data will be extracted and stored in a dictionary. You can then export this scraped data to a CSV file. To do this, import thecsvmodule and use the following logic for export.

def save_to_csv(self, filename="books.csv"): with open(filename, "w", newline="") as csvfile: fieldnames = ["title", "price", "availability"] writer = csv.DictWriter(csvfile, fieldnames=fieldnames) writer.writeheader() for book in self.book_details: writer.writerow(book)

Here's the complete Python script for the web crawler.

To demonstrate the functionality, I have set a limit that extracts only 25 books. However, you can remove the limit if you want to extract all the books on the website.

Please note that the process may take some time.

import csvfrom urllib.parse import urljoinimport requestsfrom bs4 import BeautifulSoupclass MyWebCrawler: def __init__(self, initial_urls=[]): self.visited_links = [] self.book_details = [] self.links_to_visit = initial_urls def navigate(self, url): try: html_content = requests.get(url).content soup = BeautifulSoup(html_content, "lxml") book_containers = soup.find_all("article", class_="product_pod") for container in book_containers: if len(self.book_details) >= 25: return title = container.find("h3").find("a")["title"] price = container.find("p", class_="price_color").text availability = container.find( "p", class_="instock availability" ).text.strip() self.book_details.append( {"title": title, "price": price, "availability": availability} ) for anchor_tag in soup.find_all("a", href=True): link = urljoin(url, anchor_tag["href"]) if link not in self.visited_links and link not in self.links_to_visit: self.links_to_visit.append(link) except Exception as e: print(f"Failed to navigate: {url}. Error: {e}") def start(self): while self.links_to_visit: if len(self.book_details) >= 25: break current_link = self.links_to_visit.pop(0) self.navigate(current_link) self.visited_links.append(current_link) def save_to_csv(self, filename="books.csv"): with open(filename, "w", newline="") as csvfile: fieldnames = ["title", "price", "availability"] writer = csv.DictWriter(csvfile, fieldnames=fieldnames) writer.writeheader() for book in self.book_details: writer.writerow(book)if __name__ == "__main__": crawler = MyWebCrawler(initial_urls=["https://books.toscrape.com/"]) crawler.start() crawler.save_to_csv()

Run the script. Once finished, you’ll find a new file named ‘books.csv' in your project folder.

Congratulations, you've just learned how to build a basic web crawler!

Yet, there are certain limitations and potential disadvantages to this code:

The crawler revisits the same pages multiple times, causing unnecessary network requests and processing overhead.
It lacks proper error handling and doesn't address specific HTTP errors, connection timeouts, or other potential issues. Additionally, there's no retry mechanism in place.
The URL queue is simply a list, making it inefficient for handling a large number of URLs.
By ignoring therobots.txtfile, the crawler can overwhelm the target server and lead to IP blocking. Therobots.txtfile often specifies acrawl-delayinstruction that should be respected to avoid overloading the server.

You can solve the above issues with custom code, implementing parallelism, retry mechanisms, and error handling separately.

However, this approach can become overly complex.

Thankfully, Python offersScrapy, an all-in-one web crawling framework.

In the next section, we'll create a web crawler using Scrapy to address these limitations. We'll explore how Scrapy provides a set of functionalities and simplifies building custom crawlers.

Advanced techniques in Python web crawling

Beyond basic libraries like Requests and BeautifulSoup, Python offers powerful tools for tackling complex web crawling challenges.

Libraries like Playwright and Selenium are headless browser automation tools that allow you to control a headless browser to interact with web pages similarly to how a real user would do. These tools are ideal for scraping complex, JavaScript-heavy websites where you need to mimic user actions like clicking buttons or filling out forms.

However, when it comes to efficiently crawling large websites and extracting structured data, Scrapy is the go-to framework in Python.

Scrapy is the most popular web scraping and crawling Python framework with nearly 51k stars onGitHub. It utilizes asynchronous scheduling and handling of requests, allowing you to define custom spiders that navigate websites, extract data, and store it in various formats.

Scrapy provides a robust suite of tools: the Scheduler manages the URL queue, the Downloader retrieves web content, Spiders parse the downloaded content and extract data, and Item Pipelines clean and store the extracted data. This makes it well-suited for a variety of web crawling tasks.

Installing Scrapy

To start web crawling using Python, install the Scrapy framework on your system. Open your terminal and run the following command:

pip install scrapy

Creating a Scrapy project

Once Scrapy is installed, use the following command to create a new project structure:

scrapy startproject satyamcrawler

Choose any name for your project, for example, “satyamcrawler”. Once you execute the command successfully, a directory structure containing various Python files will be created. This is what the typical directory structure looks like:

The core directory for Scrapy is thespidersdirectory. This is where Scrapy looks for Python files containing the code that defines how to crawl websites.

To start your crawling process, create a new Python file (bookcrawler.pyin our case) inside thespidersdirectory and write the following code:

from scrapy.spiders import CrawlSpider, Rulefrom scrapy.linkextractors import LinkExtractorclass BookCrawler(CrawlSpider): name = "satyamspider" start_urls = [ "https://books.toscrape.com/", ] allowed_domains = ["books.toscrape.com"] rules = (Rule(LinkExtractor(allow="/catalogue/category/books/")),)

The code defines aBookCrawlerclass which is subclassed from the built-incrawlspider.

Extracting data using Scrapy

Let's see how to extract data using Scrapy.

Our rules define that URLs must contain "catalogue/category/books".

This particular string is present in the URLs of the catalogs listed on the left side of thebooks.toscrape.comhomepage.

As a result, Scrapy will visit each catalog and extract the relevant data.

As we discussed, you need CSS selectors to extract data.

To extract the title, you can useh3 a::text.

For the price, usep.price_color::text.

For book availability, usep.availability::text.

Here, the::textpseudo-element is used to select the text content.

Here’s the complete code (Note that theparse_itemfunction only works after setting thecallbackattribute in yourLinkExtractor.)

from scrapy.spiders import CrawlSpider, Rulefrom scrapy.linkextractors import LinkExtractorclass BookCrawler(CrawlSpider): name = "satyamspider" start_urls = [ "https://books.toscrape.com/", ] allowed_domains = ["books.toscrape.com"] rules = ( Rule(LinkExtractor(allow="/catalogue/category/books/"), callback="parse_item"), ) def parse_item(self, response): category = response.css("h1::text").get() books = [] for book in response.css("article.product_pod"): title = book.css("h3 a::text").get() price = book.css("p.price_color::text").get() availability = book.css("p.availability::text")[1].get().strip() books.append({"title": title, "price": price, "availability": availability}) yield {"category": category, "books": books}

In short, here's what the code does:

It extracts the category of books from the page using a CSS selectorh1::text.
Then, it iterates over each book on the page usingarticle.product_pod.
For each book, it extracts the title, price, and availability using CSS selectors.
It appends this information to a list of books.
Finally, it yields a dictionary containing the category and the list of books.

You’ve created a spider that crawls a website and retrieves data. Run the spider and see the magic.

scrapy crawl satyamspider

The output in the console window shows that the data is extracted successfully and returned in the form of a dictionary.

Now, one last thing to discuss:

What if you want to exclude some categories?

For this, you can use theexcludeparameter and pass a comma-separated list of the categories you want to exclude (e.g., Travel, Mystery, Historical Fiction).

rules = Rule( LinkExtractor( allow="/catalogue/category/books/", deny=r"(travel_2|mystery_3|historical-fiction_4|sequential-art_5)", ), callback="parse_item",)

Now, when you run your code, all categories except for these 4 will be crawled.

Saving data to JSON

To save the crawled data as a JSON file, run the following command in your terminal:

scrapy crawl satyamspider -o data.json

When you execute the script, the web pages crawled by the scraper, and their corresponding data are displayed in the console.

By using the-oflag, Scrapy will store all of the retrieved data in a JSON file calleddata.json.

Once the crawl is complete, a new file nameddata.jsonwill be created in the project directory.

This file will contain all of the book-related data retrieved by the crawler.

Here’s the result:

Best practices for web crawling

If you're using Scrapy to crawl large websites like Amazon or eBay with millions of pages, you need to crawl responsibly by adjusting somesettings.

Scrapy settings allow you to customize the behavior of all its components, including the core, extensions, pipelines, and spiders.

Some of the settings are:

USER_AGENT:Allows you to specify the user agent. The default user agent is "Scrapy/VERSION (+https://scrapy.org)".
DOWNLOAD_DELAY:Prefer it to throttle your crawling speed and avoid overwhelming servers. This setting specifies the minimum number of seconds to wait between two consecutive requests to the same domain.
DOWNLOAD_TIMEOUT:The amount of time (in seconds) that the downloader will wait before timing out. The default is 180.
CONCURRENTREQUESTSPER_DOMAIN:The maximum number of concurrent (i.e. simultaneous) requests that will be performed to any single domain. The default is 8.

Scrapy crawls are optimized for a single domain by default.

If you intend to crawl across multiple domains, you'll need to adjust thesesettings for broad calls.

You can limit the total number of pages crawled using theCLOSESPIDER_PAGECOUNTsetting of theclose spider extension. If the spider exceeds this limit, it will be stopped with the reasonclosespider_pagecount.

Similarly, when crawling websites with millions of pages, configure Scrapy settings appropriately to ensure optimal performance next time.

Wrapping up

I've guided you through building a web crawler and then scraping data using Requests and Scrapy.

The Scrapy framework, designed specifically for web crawling, simplifies the process and allows you to efficiently crawl websites.

If you want to learn more about web crawling, start with basic websites like the one in this tutorial.

Try applying multiple techniques to crawl data from them.

As you progress, tackle more advanced websites like eBay or Amazon. Crawling these will present new challenges, forcing you to learn and apply more.

Once you've built the crawler and it's working as expected, consider deploying your Scrapy code to the cloud.

Apify simplifies transforming your Scrapy projects into Apify Actorswith just a few commands. The reliable cloud infrastructure lets you run, monitor, schedule, and scale your spiders efficiently.

Now, your spider is ready for thousands of users worldwide. With a good response, you can monetize your Actor and earn anywhere from a few bucks to hundreds of dollars per month.