In our previous blog we defined web scraping is the art and science of acquiring intended data from a targeted website which is publicly available. I call this art since we often come across challenges that often require out-of-the-box thinking. In this guide, we will cover the process of building a web scraper that,
- Automatically scrapes data daily
- Avoids most of the anti-scraper methodologies
- Cross-validates to detect missing data
- Generates precise and detailed logs for analysis
- Sends real-time email notification alerts
- Cleans the extracted data and stores it in the desired file format
- Transfers the file to a remote system
- Allows on-the-fly performance measurement
Before we deep dive, a brief background that lists a few ground realities is necessary. Our application needed to scrape product lists and their prices from four different websites. In this case, the team came across several problems. This blog will help you identify and tackle these challenges. I have also curated a list of the technology stack that can aid you in building a powerful web scraper.
We recommend using Python as the coding language owing to its large selection of libraries. Based on the nature of the websites that need to be scraped, the supporting frameworks would change.
Here’s a list of scenarios and possible solutions:
- Use Scrapy for any CAPTCHA challenges
- A combination of Selenium and Scrapy is ideal for websites that have fewer details in their URLs and changes data on action events such as clicking drop-downs
- Bare Selenium and Python scripts are of great help if scraping is blocked by the website’s robot.txt
These are but a few of the scenarios and not all challenges may be visible right away.
Now, let’s take tiny steps towards our end goal.
Avoid getting banned
The most important aspect of building a web scraper is to avoid getting banned! Websites have defensive systems against bots, i.e. they integrate with anti-scraping technologies. If you make multiple requests from a single IP in a short time, your application will be blocked and possibly even blacklisted. It could be a temporary block or a permanent one. Using proxies in such scenarios would be a better choice.
The Python library scrapy-rotated-proxy, automatically uses proxies in rotation. Based on your use case, a free or paid proxy plan could be opted for. In our case, the website allows requests only from the US region. Hence a paid proxy service restricted to the US region integrates with our application.
However, bombarding the website with too many requests in a short span can still get you banned. Accessing a service with millions of proxies can be cost-prohibitive. Hence use appropriate delays wherever necessary. Even with Selenium, using delays is a successful scraping strategy. However, above all else, respect the Website! Any scraping should not interfere with the website’s normal operations and the purpose it serves.
How do we execute the scraper script every day? Running a cron job set to run every day can do the work. A cron job could be set using bash commands or by using Python’s library named python-crontab. However, I do not think these approaches are flexible. The crontab is more prone to error if your scraper has multiple directories and files since it needs the absolute file paths.
Instead, I use Celery which handles periodic tasks more smoothly. Celery configures with most brokers such as RabbitMQ or Redis. Therefore, Celery can be used to periodically execute the scraper as well as hit the proxy API to fetch fresh proxies at timed intervals. Apache Airflow can also schedule workflows.
We now have a scraper that executes every day and is anonymous to the target site. However, we are missing a crucial element – the output. The output could be stored in a database, a text file, CSV file or any other file format.
Here are a few use-cases:
- In the case of CRUD operations, using the ORM model over raw SQL queries could be flexible. SQLAlchemy was our choice!
- Pandas do a great job at cleaning up data, manipulation, analysis and outputting it to a CSV file
- Outputting chunks of text to a text file needs no additional libraries
I recommend ORM since it abstracts the underlying database such as SQLite, PostgreSQL, MySQL and many more. The code remains the same across different databases. Once configured, SQLAlchemy abstracts the user from different databases.
Cross-validation of data
Furthermore, cross-validation of data can vary across use-cases. We had tabular data. The input file consists of product names with other supporting parameters. Our job was to scrape prices and store them in an output database or as a CSV. Every time the scraper executed, a few records failed. To see why this happened and which failed, I followed a simple approach,
- Reading the input file and output file using Pandas and storing them as different Dataframes
- Check if the Index of the input record is present in the output Dataframe
- Furthermore, store the indexes of failed records in Python’s set data structure
- Now, create a new Dataframe from the input DataFrame which has the indexes from our set
- Finally, retrying the scraper script only for the failed records in the above fashion
- Quit the scraper and send the output file to a remote system using the pysftp library, if there are no indexes found in our set.
This solution may not apply to all scenarios. There could be intricate problem statements where validating the data would be impossible. However, as I said, cross-validation varies across varied use-cases.
Working with source website updates
We all love updates in our life. Similarly, websites love them too! Building a scraper is not a one-time thing. The Xpaths we use in our Crawler change when a website updates itself. Hence, we need to change these Xpaths within the crawler too.
However, how do we track if our scraper has failed to fetch data?
That is where logs and notification alerts come into the picture. Selenium throws exceptions very well. Scrapy returns an empty list or NULL when an Xpath fails which it did in our case. In both cases, we enumerate a log file with a list of errors and warnings. In case of an Xpath failure, sending an email notification with the attached log can aid rapid debugging. Sending emails in Python is easy with Python’s smtplib module.
To put everything together,
- Use a shell script which executes the Scheduler
- The Scheduler triggers the cross-validator. The cross-validation script administers our crawler/spider.
- The spider file will scrape data, log errors and call the email notification service when required
- The cross-validator executes the crawler continuously till all the records are successfully scraped
- To see how long the scraper operated, start and finish times are calculated in the cross-validator
This is just one of the successful recipes with the necessary ingredients to build a web scraper. There are more intricate details and other solutions to tackle a particular problem. While including everything would fill up a book, if you think there is a better way to solve the above challenges, I’d love to hear about it via the comments section below or on our Twitter feed.
“Synerzip team is very responsive & quick to adopt new technologies. Team naturally follows best practices, does peer reviews and delivers quality output, thus exceeding client expectations.”
“Synerzip’s agile processes & daily scrums were very valuable, made communication & time zone issues work out successfully.”
“Synerzip’s flexible and responsible team grew to be an extension to the StepOne team. Typical concerns of time zone issues did not exist with Synerzip team.”
“Synerzip worked in perfect textbook Agile fashion – releasing working demos every two weeks. Though aggressive schedules, Synerzip was able to deliver a working product in 90 days, which helped Zimbra stand by their commitment to their customers.”
“Outstanding product delivery and exceptional project management, comes from DNA of Synerzip.”
“Studer product has practically taken a 180% turn from what it was, before Synerzip came in. Synerzip cost is very reasonable as compared to the work they do.”
“Synerzip makes the timezone differences work FOR the customer, enabling a positive experience for us. ‘Seeing is believing’, so we decided to give it a shot and the project was very successful.”
“The Synerzip team seamlessly integrates with our team. We started seeing results within the first sprint. And due to the team’s responsiveness, we were able to get our product to the sales cycle within 7 months.”
“Product management team from Synerzip is exceptional and has a clear understanding of Studer’s needs. Synerzip team gives consistent performance and never misses a deadline.”
“Synerzip is different because of the quality of their leadership, efficient team and clearly set methodologies. Studer gets high level of confidence from Synerzip along with significant cost advantage of almost 50%”
“Synerzip’s hiring approach and practices are worth applauding. Working with Synerzip is like
“What you see is what you get”.”
“Synerzip has dedicated experts for every area. Synerzip helped Tangoe save a lot of cost, still giving a very high quality product.”
“Synerzip gives tremendous cost advantage in terms of hiring and growing the team to be productive verses a readymade team. Synerzip is one company that delivers “co –development” to the core!”
“Synerzip is a great company to work with. Good leadership and a warm, welcoming attitude of the team are additional plus points.”
“Our relationship with Synerzip is very collaborative, and they are our true partners as our values match with theirs.”
“Synerzip has proven to be a great software product co-development partner. It is a leader because of its great culture, its history, and its employee retention policies. ExamSoft’s clients are happy with the product, and that’s how ExamSoft measures that all is going well.”
“They possess a great technical acumen with a burning desire to solve problems. The team always takes the initiative and ownership in all the processes they follow. Synerzip has played a vital role in our scaling up and was a perfect partner in cost, efficiency, and schedules.”
“As we are a startup, things change on a weekly basis, but Synerzip team has been flexible in adapting the same”
“Synerzip team has been very proactive in building the best quality software, bringing in best practices, and cutting edge innovation for our company.”
“We’ve been working for more than six years with Synerzip and its one of the better, if not the best, experience I’ve had working with an outsourcing company.”
“My experience with Synerzip is that they have the talent. You throw a problem at them, and someone from that team helps to solve the issue.”
“The breadth and depth of technical abilities that Synerzip brings on the table and the UX work done by them for this project exceeded my expectations!”
“Synerzip UX designers very closely represent their counterparts in the US in terms of their practice, how they tackle problems, and how they evangelize the value of UX.”
“Synerzip team understood the requirements well and documented them to make sure they understood them rightly.”
“Synerzip is definitely not a typical offshore company. Synerzip team is incredibly communicative, agile, and delivers on its commitments.”
“Working with Synerzip helped us accelerate our roadmap in ways we never thought possible!”
“While working with Synerzip, I get a feeling of working with a huge community of resources, who can jump in with the skills as needed.”