Posted On 20 Sep 2022
How to Scrape Dynamic Web Page
If you don’t know, Dynamic Web Pages are web pages that can be personalized for each user and allow the user to interact with the web pages. Back in the early years of the internet, a web page was a static page that showed the same information to everyone.
Scraping has transformed many business processes, enabling more data-driven decision-making. It also got easier over time with low code and no code web scraping solutions that do not require coding.
A common barrier you may face while building your web scraping solution is the complexity of dynamic web pages. In this article, we will explain how to scrape dynamic web page, why it is challenging to scrape them, and solutions to collect information from those websites.
Today, most of the websites you imagine will fall into the dynamic category. Any website you search for a product or service that you can click on to see more sections or see different images and ads based on who you are is an example of dynamic website content.
Dynamic websites display unique content whenever visitors contact the site through server-side and client-side scripting.
Scraping dynamic web page is difficult since the content can change based on what the user wants to see. You must render the code before the page is loaded on someone’s browser. The web page’s source code does not yet contain the information you would like to scrape.
There are different approaches to scraping a dynamic webpage:
- Selenium Python Library
- Import necessary parts of Selenium
I’ll explain these approaches one by one now in this article. You can contact us if you have difficulty understanding or want further information.
Beautiful Soup is a popular Python module that parses a downloaded web page into a specific format and provides a convenient interface to navigate content.
The official documentation of Beautiful Soup can be found on ScrapeWithBots. The latest version of the module can be installed using this command: pip install beautifulsoup4.
Beautiful Soup is an excellent tool for extracting data from web pages, but it works with the page’s source code. Dynamic sites must be rendered as the web page displayed in the browser — that’s where Selenium comes in.
This process usually involves two steps:
We first have to download the page as a whole. This step is like opening the page in your web browser when scraping manually.
Now, we have to extract the recipe in the HTML of the website and convert it to a machine-readable format like JSON or XML.
According to the web, Beautiful Soup is a Python library used for web scraping to pull the data out of HTML and XML files. It creates a parse tree from page source code that you can use to extract data in a hierarchical and more readable manner.
This python library is called bs4, and you need to use it with an import statement at the beginning of your python code.
Selenium is one of the most popular web browser automation tools for Python. It allows communication with different web browsers using a special connector – a web driver.
Selenium is one of the first big automation clients created for automating website testing. It supports two browser control protocols: the older webdriver protocol and Selenium v4 Chrome devtools Protocol (CDP).
It’s a stylish package implemented in multiple languages and supports all major web browsers. It has a vast community, an extensive feature set, and a robust underlying structure.
- Chrome, Firefox, Safari, Edge, Internet Explorer (and their derivatives)
The big community has been around for a while – meaning loads of free resources. Easy to understand synchronous API for everyday automation tasks.
Selenium is a free (open source) automated testing framework to validate web applications across different browsers and platforms. You can use multiple programming languages, like Java, C#, Python, etc., to create Selenium Test Scripts. Here, we use Python as our primary language.
A Step-by-Step Guide to scrape dynamic web page through Python selenium.
The first step is to visit our target website and inspect the HTML.
The second step is now that you have the HTML opened up in front of you, you’ll need to find the correct elements. Since the website we chose contains quotes from famous authors, we are going to scrape the elements with the following classes:
- ‘quote’ (the whole quote entry), then for each of the quotes, we’ll select these classes:
Step third is we already have Selenium set up, but we also need to add the By method from the Selenium Python library to simplify selection. This method lets you know whether or not certain elements are present.
The fourth step is you can leave the driver setup like it is in the example. Just make sure to add and choose the browser you’ll be scraping. As for the target (https://scrapewithbots.com/ ), you should type it in exactly.
Let’s also set up a variable of how many pages we want to scrape. Let’s say we’ll wish to 3. The scraper output will be a JSON object quotes list containing information about the quotes we’ll scrape.
The fifth step is choosing from various selectors, but for this guide, we used Class. Let’s go back to the Selenium Python library for a bit. We will locate our elements using the Class selector in this case (By.CLASS_NAME).
Now that we have our selector up and running, we first have to open the target website in Selenium. The command line will do just that. It’s time to get the information that we need.
In the sixth step, it’s finally time to start doing some scraping. Don’t worry. Yes, there will be quite a bit of code to tackle, but we’ll give examples on the link and explain each step of the process here https://scrapewithbots.com/.
The seventh step is we’re nearing the finish line. All that’s left now is to add the last line of code that will print out the results of your scraping request to the console window after the pages finish scraping.
- pprint (quotes_list)
Can you scrape dynamic web page in Google Sheets? Yes, you can. Google sheets can be regarded as a basic web scraper. You can use a unique formula to extract data from websites, import it directly to google sheets and share it with your friends.
- Open a new Google sheet.
- Open a target website with Chrome. In this case, we choose Games sales. Please right-click on the web page, which brings out a drop-down menu. Then select “inspect.” Press a combination of three keys: “Ctrl” + “Shift” + “C” to activate “Selector”. It would allow the inspection panel to get the information of the selected element within the webpage.
- Copy and paste the website URL into the sheet.
- Copy the Xpath of the element. Select the price element and Right-Click to bring out the drop-down menu. Then select “Copy,” and choose “Copy XPath.”
- Type the formula into the spreadsheet. =IMPORTXML(“URL”, “XPATH expression”)
Note the “Xpath expression” is the one we just copied from Chrome. Replace the double quotation mark “within the Xpath expression with a single quotation mark”.
Dynamic web pages, like static websites, contain a lot of data to help you better understand your industry. Many dynamic websites contain more data than most static ones since the former typically has more images and social media integrations.
Here’s a list of some of the essential data you can get from dynamic web pages:
- What are your competitors doing in your industry
- Reviews of your competitors’ products
Using scraping techniques like Beautifulsoup, Python Selenium, and Google Sheets, you can get data from a dynamic website.
In a nutshell, go with BeautifulSoup if you want to speed up development or if you want to familiarize yourself with Python and web scraping. With Scrapy, demanding you can implement web scraping applications in Python — provided you have the appropriate know-how. Use Selenium and google sheets if your primary goal is to scrape dynamic content with Python.
Thanks for reading. If you wish to learn more about web scraping, don’t hesitate to get in touch with us at ScrapeWithBots.