« Back to News

A practical guide to building a web scraper

In our previous blog we defined web scraping is the art and science of acquiring intended data from a targeted website which is publicly available. I call this art since we often come across challenges that often require out-of-the-box thinking. In this guide, we will cover the process of building a web scraper that,

  • Automatically scrapes data daily
  • Avoids most of the anti-scraper methodologies
  • Cross-validates to detect missing data
  • Generates precise and detailed logs for analysis
  • Sends real-time email notification alerts
  • Cleans the extracted data and stores it in the desired file format
  • Transfers the file to a remote system
  • Allows on-the-fly performance measurement

Before we deep dive, a brief background that lists a few ground realities is necessary. Our application needed to scrape product lists and their prices from four different websites. In this case, the team came across several problems. This blog will help you identify and tackle these challenges. I have also curated a list of the technology stack that can aid you in building a powerful web scraper.

We recommend using Python as the coding language owing to its large selection of libraries. Based on the nature of the websites that need to be scraped, the supporting frameworks would change. 

Here’s a list of scenarios and possible solutions:

  1. Use Scrapy for any CAPTCHA challenges
  2. A combination of Selenium and Scrapy is ideal for websites that have fewer details in their URLs and changes data on action events such as clicking drop-downs
  3. Bare Selenium and Python scripts are of great help if scraping is blocked by the website’s robot.txt

These are but a few of the scenarios and not all challenges may be visible right away. 

Now, let’s take tiny steps towards our end goal.

Avoid getting banned

The most important aspect of building a web scraper is to avoid getting banned! Websites have defensive systems against bots, i.e. they integrate with anti-scraping technologies. If you make multiple requests from a single IP in a short time, your application will be blocked and possibly even blacklisted. It could be a temporary block or a permanent one. Using proxies in such scenarios would be a better choice. 

The Python library scrapy-rotated-proxy, automatically uses proxies in rotation. Based on your use case, a free or paid proxy plan could be opted for. In our case, the website allows requests only from the US region. Hence a paid proxy service restricted to the US region integrates with our application.

However, bombarding the website with too many requests in a short span can still get you banned. Accessing a service with millions of proxies can be cost-prohibitive. Hence use appropriate delays wherever necessary. Even with Selenium, using delays is a successful scraping strategy. However, above all else, respect the Website! Any scraping should not interfere with the website’s normal operations and the purpose it serves.

Repetitive scraping

How do we execute the scraper script every day? Running a cron job set to run every day can do the work. A cron job could be set using bash commands or by using Python’s library named python-crontab. However, I do not think these approaches are flexible. The crontab is more prone to error if your scraper has multiple directories and files since it needs the absolute file paths. 

Instead, I use Celery which handles periodic tasks more smoothly. Celery configures with most brokers such as RabbitMQ or Redis. Therefore, Celery can be used to periodically execute the scraper as well as hit the proxy API to fetch fresh proxies at timed intervals. Apache Airflow can also schedule workflows.

Output formats

We now have a scraper that executes every day and is anonymous to the target site. However, we are missing a crucial element – the output. The output could be stored in a database, a text file, CSV file or any other file format.

Here are a few use-cases:

  • In the case of CRUD operations, using the ORM model over raw SQL queries could be flexible. SQLAlchemy was our choice!
  • Pandas do a great job at cleaning up data, manipulation, analysis and outputting it to a CSV file
  • Outputting chunks of text to a text file needs no additional libraries

I recommend ORM since it abstracts the underlying database such as SQLite, PostgreSQL, MySQL and many more. The code remains the same across different databases. Once configured, SQLAlchemy abstracts the user from different databases.

Cross-validation of data

Furthermore, cross-validation of data can vary across use-cases. We had tabular data. The input file consists of product names with other supporting parameters. Our job was to scrape prices and store them in an output database or as a CSV. Every time the scraper executed, a few records failed. To see why this happened and which failed, I followed a simple approach,

  1. Reading the input file and output file using Pandas and storing them as different Dataframes
  2. Check if the Index of the input record is present in the output Dataframe
  3. Furthermore, store the indexes of failed records in Python’s set data structure
  4. Now, create a new Dataframe from the input DataFrame which has the indexes from our set
  5. Finally, retrying the scraper script only for the failed records in the above fashion
  6. Quit the scraper and send the output file to a remote system using the pysftp library, if there are no indexes found in our set.

This solution may not apply to all scenarios. There could be intricate problem statements where validating the data would be impossible. However, as I said, cross-validation varies across varied use-cases.

Working with source website updates

We all love updates in our life. Similarly, websites love them too! Building a scraper is not a one-time thing. The Xpaths we use in our Crawler change when a website updates itself. Hence, we need to change these Xpaths within the crawler too.

However, how do we track if our scraper has failed to fetch data?

That is where logs and notification alerts come into the picture. Selenium throws exceptions very well. Scrapy returns an empty list or NULL when an Xpath fails which it did in our case. In both cases, we enumerate a log file with a list of errors and warnings. In case of an Xpath failure, sending an email notification with the attached log can aid rapid debugging. Sending emails in Python is easy with Python’s smtplib module.

Conclusion

To put everything together, 

  • Use a shell script which executes the Scheduler
  • The Scheduler triggers the cross-validator. The cross-validation script administers our crawler/spider.
  • The spider file will scrape data, log errors and call the email notification service when required
  • The cross-validator executes the crawler continuously till all the records are successfully scraped
  • To see how long the scraper operated, start and finish times are calculated in the cross-validator

This is just one of the successful recipes with the necessary ingredients to build a web scraper. There are more intricate details and other solutions to tackle a particular problem. While including everything would fill up a book, if you think there is a better way to solve the above challenges, I’d love to hear about it via the comments section below or on our Twitter feed.

Leave a Reply

About the Writer

  • Yash Ghorpade

    Yash is a software developer at Synerzip. Over the course of two years he has gained know-how on data and expertise in exploratory data analysis, data visualization, manipulation, natural language processing and machine learning. In his spare time, Yash builds web apps using Python. He holds a Bachelors in Computer Engineering degree from Savitribai Phule Pune University,

How Can Synerzip Help You?

By partnering with Synerzip, clients rapidly scale their engineering team, decrease time to market and save at least 50 percent with our Agile development teams in India.