- Web Scrape Selenium Tutorial
- Web Scraping Using Selenium
- Web Scrape Selenium Download
- Web Scraping Selenium Ide
Jan 22, 2019 RSelenium automates a web browser and lets us scrape content that is dynamically altered by JavaScript for example. In this RSelenium tutorial, we will be going over two examples of how it can be used. For example #1, we want to get some latitude and longitude coordinates for some street addresses we have in our data set.
This page explains how to do web scraping with Selenium IDE commands. Web scraping works if the data is inside the HTML of a website. If you want to extract data from a PDF, image or video you need to use visual screen scraping instead.
- Oct 13, 2019 Selenium is a framework for web testing that allows simulating various browsers and was initially made for testing front-end components and websites. As you can probably guess, whatever one would like to test, another would like to scrape. And in the case of Selenium, this is a perfect library for scraping.
- What is web scraping? To align with terms, web scraping, also known as web harvesting, or web data extraction is data scraping used for data extraction from websites. The web scraping script may access the url directly using HTTP requests or through simulating a web browser. The second approach is exactly how selenium works – it simulates a.
When to use what command?
The table belows shows the best command for each type of data extraction. Click the recommended command for more information and example code.
Data to extract is in... | Command to use | Comment |
---|---|---|
Visible website text, for example text in a table just like this one, or a price on website | storeText | |
Text in input fields (input box, text area, select drop down,...) | storeValue | Do not confuse this command with storeEval, which is not for web scraping. |
Get the status of a checkbox or radiobutton | storeChecked | |
URL 'behind' an image | storeAttribute@href | storeAttribute | xpath=...@href extracts the link of any element - if it has one! If that fails, consider browser automation to copy the link to the ${!clipboard} variable. |
ALT text 'behind' an image | storeAttribute@alt | The storeAttribute command can be used to get any attribute the HTML element has. For example, use @alt to get the 'Alt' text of an image. |
Page title | storeTitle | |
Table content: Row/Column/Cell | storeText with XPath locator | See TABLE Web Scraping or automate browser addon |
Data from a list e. g. search results | Loop over storeText | See How to web scrape search results |
Save complete web page source code | XType | ${KEY_CTRL+KEY_S}* | On Mac it is ${KEY_CMD+KEY_S}. |
Save complete web page with images | XType | ${KEY_CTLR+KEY_S}* | See Forum post: How to save the entire HTML code |
Take screenshot of website | captureEntirePageScreenshot* | This saves the complete website as image. |
Take screenshot of a web page element | storeImage* | This is an easy way to extract images. The other option is to download them. |
Text found only website source code | sourceExtract* | e. g. Google Analytics ID. For text inside page comments or Javascript, this is the only option |
PDF, Image, Video, Canvas | OCRExtractRelative* | This screen scraping command works everywhere because it works visually. The disadvantage is that it is slower than the pure HTML-based commands like storeText. |
Text from outside the web page | OCRExtractRelative* | For example, if you want to extract data from a browser extension or a desktop app |
(*) These commands are only available in the UI.Vision RPA Selenium IDE. They are not part of the classic Selenium IDE.
See also
- - Screen scraping (scraping/data extraction with computer vision, OCR)
- - Form filling with Selenium IDE (the opposite of web scraping)
- - File uploads with Selenium IDE
- - Best Selenium IDE Locator Strategy
- - RPA Software User Manual.
Web Scrape Selenium Tutorial
Anything wrong or missing on this page? Suggestions?
...then please contact us.
UI.Vision RPA Selenium IDE for Chrome and Firefox - Web Test Automation'>In the last tutorial we learned how to leverage the Scrapy framework to solve common web scraping problems.Today we are going to take a look at Selenium (with Python ❤️ ) in a step-by-step tutorial.
Selenium refers to a number of different open-source projects used for browser automation. It supports bindings for all major programming languages, including our favorite language: Python.
The Selenium API uses the WebDriver protocol to control a web browser, like Chrome, Firefox or Safari. The browser can run either localy or remotely.
At the beginning of the project (almost 20 years ago!) it was mostly used for cross-browser, end-to-end testing (acceptance tests).
Now it is still used for testing, but it is also used as a general browser automation platform. And of course, it us used for web scraping!
Selenium is useful when you have to perform an action on a website such as:
- Clicking on buttons
- Filling forms
- Scrolling
- Taking a screenshot
Web Scraping Using Selenium
It is also useful for executing Javascript code. Let's say that you want to scrape a Single Page Application. Plus you haven't found an easy way to directly call the underlying APIs. In this case, Selenium might be what you need.
Installation
We will use Chrome in our example, so make sure you have it installed on your local machine:
selenium
package
To install the Selenium package, as always, I recommend that you create a virtual environment (for example using virtualenv) and then:
Quickstart
Once you have downloaded both Chrome and Chromedriver and installed the Selenium package, you should be ready to start the browser:
This will launch Chrome in headfull mode (like regular Chrome, which is controlled by your Python code).You should see a message stating that the browser is controlled by automated software.
To run Chrome in headless mode (without any graphical user interface), you can run it on a server. See the following example:
The driver.page_source
will return the full page HTML code.
Here are two other interesting WebDriver properties:
driver.title
gets the page's titledriver.current_url
gets the current URL (this can be useful when there are redirections on the website and you need the final URL)
Locating Elements
Locating data on a website is one of the main use cases for Selenium, either for a test suite (making sure that a specific element is present/absent on the page) or to extract data and save it for further analysis (web scraping).
There are many methods available in the Selenium API to select elements on the page. You can use:
- Tag name
- Class name
- IDs
- XPath
- CSS selectors
We recently published an article explaining XPath. Don't hesitate to take a look if you aren't familiar with XPath.
As usual, the easiest way to locate an element is to open your Chrome dev tools and inspect the element that you need.A cool shortcut for this is to highlight the element you want with your mouse and then press Ctrl + Shift + C or on macOS Cmd + Shift + C instead of having to right click + inspect each time:
find_element
There are many ways to locate an element in selenium.Let's say that we want to locate the h1 tag in this HTML:
All these methods also have find_elements
(note the plural) to return a list of elements.
For example, to get all anchors on a page, use the following:
Some elements aren't easily accessible with an ID or a simple class, and that's when you need an XPath expression. You also might have multiple elements with the same class (the ID is supposed to be unique).
XPath is my favorite way of locating elements on a web page. It's a powerful way to extract any element on a page, based on it's absolute position on the DOM, or relative to another element.
WebElement
A WebElement
is a Selenium object representing an HTML element.
There are many actions that you can perform on those HTML elements, here are the most useful:
- Accessing the text of the element with the property
element.text
- Clicking on the element with
element.click()
- Accessing an attribute with
element.get_attribute('class')
- Sending text to an input with:
element.send_keys('mypassword')
There are some other interesting methods like is_displayed()
. This returns True if an element is visible to the user.
It can be interesting to avoid honeypots (like filling hidden inputs).
Honeypots are mechanisms used by website owners to detect bots. For example, if an HTML input has the attribute type=hidden
like this:
This input value is supposed to be blank. If a bot is visiting a page and fills all of the inputs on a form with random value, it will also fill the hidden input. A legitimate user would never fill the hidden input value, because it is not rendered by the browser.
That's a classic honeypot.
Full example
Here is a full example using Selenium API methods we just covered.
We are going to log into Hacker News:
In our example, authenticating to Hacker News is not really useful on its own. However, you could imagine creating a bot to automatically post a link to your latest blog post.
In order to authenticate we need to:
- Go to the login page using
driver.get()
- Select the username input using
driver.find_element_by_*
and thenelement.send_keys()
to send text to the input - Follow the same process with the password input
- Click on the login button using
element.click()
Should be easy right? Let's see the code:
Easy, right? Now there is one important thing that is missing here. How do we know if we are logged in?
We could try a couple of things:
Web Scrape Selenium Download
- Check for an error message (like “Wrong password”)
- Check for one element on the page that is only displayed once logged in.
So, we're going to check for the logout button. The logout button has the ID “logout” (easy)!
We can't just check if the element is None
because all of the find_element_by_*
raise an exception if the element is not found in the DOM.So we have to use a try/except block and catch the NoSuchElementException
exception:
Taking a screenshot
We could easily take a screenshot using:
Note that a lot of things can go wrong when you take a screenshot with Selenium. First, you have to make sure that the window size is set correctly.Then, you need to make sure that every asynchronous HTTP call made by the frontend Javascript code has finished, and that the page is fully rendered.
In our Hacker News case it's simple and we don't have to worry about these issues.
Web Scraping Selenium Ide
Waiting for an element to be present
Dealing with a website that uses lots of Javascript to render its content can be tricky. These days, more and more sites are using frameworks like Angular, React and Vue.js for their front-end. These front-end frameworks are complicated to deal with because they fire a lot of AJAX calls.
If we had to worry about an asynchronous HTTP call (or many) to an API, there are two ways to solve this:
- Use a
time.sleep(ARBITRARY_TIME)
before taking the screenshot. - Use a
WebDriverWait
object.
If you use a time.sleep()
you will probably use an arbitrary value. The problem is, you're either waiting for too long or not enough.Also the website can load slowly on your local wifi internet connection, but will be 10 times faster on your cloud server.With the WebDriverWait
method you will wait the exact amount of time necessary for your element/data to be loaded.
This will wait five seconds for an element located by the ID “mySuperId” to be loaded.There are many other interesting expected conditions like:
element_to_be_clickable
text_to_be_present_in_element
element_to_be_clickable
You can find more information about this in the Selenium documentation
Executing Javascript
Sometimes, you may need to execute some Javascript on the page. For example, let's say you want to take a screenshot of some information, but you first need to scroll a bit to see it.You can easily do this with Selenium:
Conclusion
I hope you enjoyed this blog post! You should now have a good understanding of how the Selenium API works in Python. If you want to know more about how to scrape the web with Python don't hesitate to take a look at our general Python web scraping guide.
Selenium is often necessary to extract data from websites using lots of Javascript. The problem is that running lots of Selenium/Headless Chrome instances at scale is hard. This is one of the things we solve with ScrapingBee, our web scraping API
Selenium is also an excellent tool to automate almost anything on the web.
If you perform repetitive tasks like filling forms or checking information behind a login form where the website doesn't have an API, it's maybe* a good idea to automate it with Selenium,just don't forget this xkcd: