Python requests-html – Learn Web scraping

Last year, I started Freelancing as a Web Scrapper using requests and beautifulsoup modules. After a few projects, I faced a strange issue while scrapping a website.

that website was using some Javascript code. I was unable to solve the javascript rendering Problem with the Python requests module.

This made me wonder and I started researching to look if there is a python library that can help me solve the Javascript rendering Problem. And it turns out that we have requests-html library for Javascript rendering problem.

 In this article I will explain you the easiest ways of web scraping, using python requests-html library.

Web Scraping

Web Scraping is extracting the required information from a webpage. For me, It was a good source of income, when I started Freelancing with Python.

If you know Python basics, then Learning web scraping will be no less than fun for you. you can do many interesting things with scrapping websites in Python.

Python Programming is a good choice if you ever think of web scraping. Python offers different libraries to scrape websites. requests-html is a good example of a Python library for web scraping.

What is requests-html?

requests-html is a python library for scrapping websites. it can help you scrape any type of website including the dynamic websites. requests-html support javascript rendering and this is the reason it is different from other python libraries used for web scraping.

Python requests-html module is the best library for web scraping. Once you learned requests-html, Scrapping websites will be a piece of cake for you. You will understand at the end of this requests-html tutorial.

JavaScript rendering

When the developer uses Javascript to manipulate the Document Object Model (DOM) Elements, it is called Javascript rendering. In simple words Javascript rendering means, using Javascript to show output in the browser.

Example of Javascript Rendering

<!DOCTYPE html>
<html lang="en">
<head>
    <title>Document</title>
</head>
<body >
    <!-- H1 element will be create in body using Javascirpt -->
   
   <script type="text/javascript">
        var h1_tag = document.createElement("h1");
        h1_tag.innerHTML = "H1 Generated with Javascirpt";
        body_tag = document.getElementsByTagName('body')[0];
        body_tag.appendChild(h1_tag);
    </script>
</body>
</html> 

Why should you use requests-html?

requests-Html solves the Javascript rendering problem, this is the reason you should use the requests-html library in python. There are requests, beautifulsoup, and scrappy used for web scraping, but requests-html is the easiest way to scrape a website among all of them.

Features of Python requests-html library

  1. Async Support
  2. JavaScript support
  3. cookie persistence
  4. parsing abilities
  5. Support Multiple Selectors

you can use the requests-html python library, to parse HTML files without request. Javascript rendering is also supported for local files. See Example

How to use the requests-html library?

When you are scrapping websites with the python requests-html library, you should follow the following steps to extract the data.

Step 1: Find the target element on the web page.

Step 2: Inspect the target element that you want to extract.

Step 3: use the Proper selector (ID, Class name, XPath)

Step 4: Get the Target element using the requests-html library


Install requests-html library in Python

Before doing anything else, first of all, we need to install the requests-html library. requests-html is not a built-in module but can be easily installed. Depending on your system you should follow different approaches to install requests-html.

Install requests-html using pip

pip is probably the easiest way to install a python package. you can use pip to install requests-html library.

copy the code and run it on the terminal to Install the latest version of requests-html library

 python -m pip install requests-html

copy the code and run it on the terminal to install a specific version of requests-html

python -m pip install requests-html==0.10.0

If you want to upgrade the already installed requests-html library then run the following command on terminal

python -m pip install --upgrade requests-html

Install requests-html using conda

To install the latest version of requests-html using conda enter the following command and run it.

conda install requests-html

Install requests-html in jupyter

Jupyter is a good IDE for working with Web-scrapping related projects. In Jupyter you can install requests-html using pip install requests-html.

 pip install requests-html

Install requests-html in Linux

If you are using Linux operating system. First, install pip and then using pip you can install the requests-html library.

pip install requests-html

Inspecting elements on a Weg Page

In scrapping a targeted element from a web page, the first step is to find that specific element on a web page. This process is known as inspecting elements. It is a three-step process.

Follow these steps to inspect an element on a web page.

  1.  Go to the specific webpage using the URL.
  2.  right-click on the Target element that you want to Extract.
  3.  Click on inspect and it will open the inspection window.

Example of Inspecting elements on a webpage:

Let’s say we want to scrape this webpage [https://www.hepper.com/most-beautiful-dog-breeds/] .

Step 1: Copy and paste the URL into your browser Search bar.

Step 2: Click on Target Element.

Let’s say you want the first section to grab. Just right-click on it.

Step 3: click on the last option inspect in the options menu shown in the above picture.

After clicking on inspect, You will the inspection windows open inside the tab. you can now get the HTML code of the element.

This is how we inspect the target elements. To better understand use the [https://webscraper.io/] website for testing purposes.

Using different types of selectors in requests-html

the requests-html library supports all kinds of selectors. We can select an element using the tagName, id, class, or XPath. In this section, I will guide you on how to use different CSS selectors to grab an element

Select element using id in requests-html

The best way to select an element is to use the id of that element. Using ID is the best option, as we only have one id on a webpage. Id is a unique selector.

to select an element using the id in requests-html, use the r.find(‘#id’) method.

Example No 1: Select an element of a webpage using the Id

For test purposes use the https://webscraper.io webpage.

We will Grab the navbar with id ‘navbar’ from this website using the id of the element.

# importing the HTMLSession class
from requests_html import HTMLSession
# create the object of the session
session = HTMLSession()
# url of the page
web_page = 'https://webscraper.io/'
# making get request to the webpage
respone = session.get(web_page)
# getting the html of the page
page_html = respone.html
# finding element with id 'navbar'
navbar= page_html.find('#navbar')
# printing element
print(navbar)

The output of the code is a navbar element
Output: [<Element ‘nav’ id=’navbar’ role=’navigation’ class=(‘navbar-collapse’, ‘collapse’)>]

Select element using the class name in requests-html

Just like the id, we can find an element using the class name. A class can be assigned to more than one element and this is the reason that finding an element by the class name will return a list of elements. You can use the r.find(‘.className’) function to find an element by class name in requests-html.

Example No 2: Select an element by using the class name in requests-html

In this example, we will grab the video on the home page of [https://webscraper.io/] website. On inspecting the video, the class name of the video is “intro-video-wrapper”. So I will use this class name to find the video URL.

# importing the HTMLSession class
from requests_html import HTMLSession
# create the object of the session
session = HTMLSession()
# url of the page
web_page = 'https://webscraper.io/'
# making get request to the webpage
respone = session.get(web_page)
# getting the html of the page
page_html = respone.html
# finding element with class name 'embedded-video'
video_frame= page_html.find('.embedded-video')
# get all atributes
video_attrs = video_frame[0].attrs
# find the url using dict.get()
video_url = video_attrs['src']
# printing element
print(video_url)

The output of the code is the URL of the youtube video.Output: //www.youtube.com/embed/aViWT-WpzYI?vq=highres&enablejsapi=true


Select elements using tag name in requests-html

To find an element using the tag name of an element using the requests-html, use the r.find(‘tagName’) function. It will return the list of all specific tags.

This is the most general case, where you want to find all similar tags, let’s say you want to get the all the rows of a table. Or maybe list items of a list.

Example No 3: Select a specific tag with requests-html

In this example, we want to scrape all the paragraph tags from the [https://webscraper.io/] website.

# importing the HTMLSession class
from requests_html import HTMLSession
# create the object of the session
session = HTMLSession()
# url of the page
web_page = 'https://webscraper.io/'
# making get request to the webpage
respone = session.get(web_page)
# getting the html of the page
page_html = respone.html
# finding all the paragraphs
all_paragraphs= page_html.find('p')
# printing list of paragraphs
print(all_paragraphs)

The output of the code is a list of all paragraph elementsOutput: [<Element ‘p’ >, <Element ‘p’ >, <Element ‘p’ >, <Element ‘p’ >, <Element ‘p’ >, <Element ‘p’ class=(‘lead’,)>, <Element ‘p’ class=(‘lead’,)>, <Element ‘p’ class=(‘lead’,)>, <Element ‘p’ class=(‘lead’,)>, <Element ‘p’ class=(‘lead’,)>, <Element ‘p’ >, <Element ‘p’ >, <Element ‘p’ class=(‘about’,)>, <Element ‘p’ >, <Element ‘p’ >, <Element ‘p’ >, <Element ‘p’ >, <Element ‘p’ >, <Element ‘p’ >, <Element ‘p’ >, <Element ‘p’ >, <Element ‘p’ >, <Element ‘p’ >, <Element ‘p’ >, <Element ‘p’ class=(‘copyright’,)>]

Select element using CSS attribute in requests-html

Besides the id and the class name, we can use other CSS attributes to get the elements from the webpage. To scrape an element using the CSS attributes use the find(‘[CSS_Attribute=”value”]’) function. It will grab the specified elements from the webpage.

with requests and beautiful soup, you can achieve the same results but you will have to take an extra step. This is the beauty of the requests-html library.

Example No 4: Select HTML elements using the CSS attributes in requests-html library

In this example, we will use the same website to grab the header. The header has an attribute ‘role’ and its value is ‘banner’. So we will use requests-html to find the header using ‘role’ as a CSS selector.

# importing the HTMLSession class
from requests_html import HTMLSession
# create the object of the session
session = HTMLSession()
# url of the page
web_page = 'https://webscraper.io/'
# making get request to the webpage
respone = session.get(web_page)
# getting the html of the page
page_html = respone.html
# finding elements with the CSS attribute 'role'
header= page_html.find('[role="banner"]')
# printing the element
print(header)

The output of the code is a list of the elements with the ‘role=banner’ attribute.Output: [<Element ‘header’ role=’banner’ class=(‘navbar’, ‘navbar-fixed-top’, ‘navbar-static’)>]


Select element using text in requests-html

Well, the power of requests-html even increases more with this amazing feature of finding an element using a text inside the element. To find an element based on certain text, you can use the r.find(‘selector’,containing=’text’) function. this will return a list of all elements containing that particular text.

Example No 5: Find an element on a page based on text in requests-html

In this Python code example, we will find all the paragraphs containing the ‘web data extraction’ text in it.

 # importing the HTMLSession class
from requests_html import HTMLSession
# create the object of the session
session = HTMLSession()
# url of the page
web_page = 'https://webscraper.io/'
# making get request to the webpage
respone = session.get(web_page)
# getting the html of the page
page_html = respone.html
# finding elements based on text
p_tag_with_text= page_html.find('p',containing='web data extraction')
# printing the element
print(p_tag_with_text)

The output of the code is the list of paragraph tags containing the ‘web data extraction’ tag in it.

Output: [<Element ‘p’ class=(‘lead’,)>]


Select element using xpath in requests-html

When you want to get the HTML element in the most easiest way but there is no id of that element. worry not we have the XPath option in requests-html which make it easy to find an element in a webpage.

XPath can be used to navigate through elements and attributes in an HTML document. If you do not know how to create XPATH to an element. Give a read to this Microsoft article about XPATH.

Example No 6: Find an element with XPath in requests-html library

In this example, we have used the XPath of the element to get the specified element with requests-html.

# importing the HTMLSession class
from requests_html import HTMLSession
# create the object of the session
session = HTMLSession()
# url of the page
web_page = 'https://webscraper.io/'
# making get request to the webpage
respone = session.get(web_page)
# getting the html of the page
page_html = respone.html
# finding all divs which have h2 child using xpath
divs_parent_to_h2= page_html.xpath('//div//h2')
# printing the elements list
print(divs_parent_to_h2)

The output of this code is the list of ‘div’ elements that have ‘h2’ child.Output: [<Element ‘h2’ class=(‘featurette-heading’,) id=’landing-heading-point-and-click-interface’>, <Element ‘h2’ class=(‘featurette-heading’,) id=’landing-heading-extract-from-dynamic-web’>, <Element ‘h2’ class=(‘featurette-heading’,) id=’landing-heading-built-for-modern-web’>, <Element ‘h2’ class=(‘featurette-heading’,) id=’landing-heading-modular-selectors’>, <Element ‘h2’ class=(‘featurette-heading’,) id=’landing-heading-export-data-in-csv’>, <Element ‘h2’ >, <Element ‘h2’ >, <Element ‘h2’ >, <Element ‘h2’ >, <Element ‘h2’ >]


Get text from HTML element in requests-html

Most of the time our target on the webpage is extracting text from different HTML tags. So I dedicated this section to explain to you how to extract texts from different Html elements.

To get the text of any HTML element in python use the following steps

  • Step 1: Install the requests-html library
  • Step 2 : create HTML Session
  • Step 3: make a get request using requests-html
  • Step 4: get all the HTML from the response
  • Step 5 : use the find() function to find elements
  • Step 6: get the text from all the elements using the text attribute of the element.

extract the text of h2 tags in requests-html

The WebPage is https://www.investopedia.com/terms/s/sample.asp

The Target is to extract text from all <h2> Tags

The code to extract text from all h2 tags is following

# importing the HTMLSession class
from requests_html import HTMLSession
# create the object of the session
session = HTMLSession()
# url of the page
web_page = 'https://www.investopedia.com/terms/s/sample.asp'
# making get request to the webpage
respone = session.get(web_page)
# getting the html of the page
page_html = respone.html
# finding all h2 tags
h2_tags= page_html.find('h2')
# extracting text from all h2 tags
for tag in h2_tags:
    print(tag.text)

The output of the above code is the text of all h2 tags

Scrape the text of a Paragraph with requests-html

In this example, we will use the Python library requests-html to extract the text of a paragraph.

The website to scrape data from is [https://totalhealthmagazine.com/About-Us]

Our target is to get the plain text from the paragraphs using the requests-html library in Python

The Python code to scrape text from all paragraphs using the requests-html library is following.

# importing the HTMLSession class
from requests_html import HTMLSession
# create the object of the session
session = HTMLSession()
# url of the page
web_page = 'https://totalhealthmagazine.com/About-Us'
# making get request to the webpage
respone = session.get(web_page)
# getting the html of the page
page_html = respone.html
# finding all p tags
p_tags= page_html.find('p')
# extracting text from all h2 tags
for tag in p_tags:
    print(tag.text)

The output of the Above Python code is the text of all paragraphs present on that page

find the meta tags of a website using requests-html

Meta tags are the tags that hold information about the sites. Meta tags are not used to show elements on the webpage. They are very important for the website.

We can use requests-html library to find all the meta tags of a webpage

following is the Python requests-html library code that finds the meta tags of the website

 # importing the HTMLSession class
from requests_html import HTMLSession
# create the object of the session
session = HTMLSession()
# url of the page
web_page = 'https://totalhealthmagazine.com/About-Us'
# making get request to the webpage
respone = session.get(web_page)
# getting the html of the page
page_html = respone.html
# finding all meta tags
meta_tags= page_html.find('meta')
# extracting meta tags html
for tag in meta_tags:
    print(tag.html)

The output of the above python code is the list of all meta tags of the websiteOutput: <meta http-equiv=”content-type” content=”text/html; charset=utf-8″/> <meta name=”keywords” content=”Total Health Magazine, health magazine, online health magazine, men’s health, women’s health, children’s health”/> <meta name=”rights” content=”TWIP 2019″/> <meta name=”robots” content=”index, nofollow, max-snippet:-1, max-image-preview:large, max-video-preview:-1″/> <meta name=”author” content=”TotalHealth Editors”/> <meta name=”description” content=”Total Health Magazine”/> <meta name=”viewport” content=”width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no”/> <meta name=”HandheldFriendly” content=”true”/> <meta name=”apple-mobile-web-app-capable” content=”YES”/> <meta itemprop=”inLanguage” content=”en-GB”/> <meta itemprop=”url” content=”/About-Us.html”/> <meta itemprop=”height” content=”auto”/> <meta itemprop=”width” content=”auto”/>

To scrape all the anchor tags or <a> tag from the website requests HTML gives us the simplest and best way.

use the response.html.links() function to get all the links from a webpage. or you can use response.html.absolute_links it to extract the absolute links.

following is a python code that extracts all the links from a website (https://www.trtworld.com/)

 # importing the HTMLSession class
from requests_html import HTMLSession
# create the object of the session
session = HTMLSession()
# url of the page
web_page = 'https://www.trtworld.com/'
# making get request to the webpage
respone = session.get(web_page)
# getting the html of the page
page_html = respone.html
# finding all <a> tags
all_links= page_html.links
# extracting meta tags html
for tag in all_links:
    print(tag)

# getting only absolute links
absolute_links = page_html.absolute_links
for abs_link in absolute_links:
    print(abs_link)

The output of the above python code is all the relative and absolute links available on that websiteOutput: /video/social-videos/germany-declares-early-warning-of-potential-gas-supply-disruptions/624545be42517d0017741dc8 /about https://www.trtworld.com/sport https://appsto.re/tr/_6Vjbb.i https://www.trtworld.com/topics/a-place-called-pakistan https://www.trtworld.com/opinion /contact-us https://www.trtworld.com/middle-east https://twitter.com/trtworld /cookie-policy /video/news-videos/

find title

Finding a page title is easy with requests HTML. Of course, there are other ways around but the best way to find the title of a webpage with python is to use the find() function of the requests-html module.

Below is the Python code that finds the title of a webpage using the requests-html library.

from requests_html import HTMLSession
session = HTMLSession()
web_page = 'https://edition.cnn.com/'
respone = session.get(web_page)
page_html = respone.html
title= page_html.find('title')[0].text
print(title)

The output of the above code is the title of the website

Output: CNN International – Breaking News, US News, World News and Video

What is HTML Session in the requests-html library?

In requests-html a Session is a consumable session, for cookie persistence and connection pooling, amongst other things.It is a group of actions that can take place in a time frame.

Default HTML Sessoin

Only one HTMLSession can be active in normal cases. And users can interact with only one webpage at a given time frame.

Async HTML Session

Each Async Session is created in its own thread, so multiple Async sessions can be created in a single program. Multiple web pages can be scraped at the same time.

Example No 7: Scrapping 3 webpages at the same time with Async HTML session in requests-html

Three web pages are scraped at the same time. The output is unexpected is that one web page might get scraped early than the other.

from requests_html import AsyncHTMLSession
asession = AsyncHTMLSession()

async def get_cnn():
    r = await asession.get('https://edition.cnn.com/')
    title = r.html.find('title')[0].text
    print(title)

async def get_google():
    r = await asession.get('https://google.com/')
    title = r.html.find('title')[0].text
    print(title)

async def get_facebook():
    r = await asession.get('https://facebook.com/')
    title = r.html.find('title')[0].text
    print(title)

asession.run(get_google, get_facebook,get_cnn )

The output is the titles of these three webpagesOutput: CNN International – Breaking News,
US News, World News, and Video Google
Facebook -لاگ ان کریں یا سائن اپ کریں


Javascript rendering in requests-html

Javascript rendering Problem is solved with the requests-html library in python. Javascript support by requests-html makes it easy to scrape websites that use javascript for rendering HTML.

We can scrape elements that are generated by javascript and shown on the browser with the help of the requests-html library

Example No 8: In this example, we will scrape [https://www.geeksforgeeks.org/]

from requests_html import HTMLSession

session = HTMLSession()
res = session.get("https://www.geeksforgeeks.org/")
res.html.render(timeout=10000)
print(res.html.text)

The output of the code is the text that is generated after the execution of the Javascript code

How Pagination works in requests-html Library

You might have seen any social network sites, that use pagination to render elements on a webpage. You will see two or three posts on the current screen. but when you keep scrolling it renders most posts. this is done with the help of pagination.

It is hard to scrape websites that use pagination with other python libraries. requests-html python library is the best option in this scenario to scrape a page with pagination

Example No 9: In this example, we will scape URLs from dev.to website.

we could probably use Facebook, Twitter or other social networking sites, but they need you to authenticate yourself. which you know will need us to take an extra step.

from requests_html import HTMLSession
session = HTMLSession()
response = session.get('https://dev.to/')
for html in response.html:
    print(html)

The output of this code is the URLs of the post available on the home page of the website and it will keep on scrolling. buuuutttt the output is not what we expected, THe pagination property is not currently working. As they say, it is continuously improving. I mentioned this section, maybe in future, it starts working. Hope for the best

Quiz Solver Python Program

Let’s say you have a webpage and you are given questions to solve them using that webpage. Instead of looking through the webpage, you can use requests-html to answer your quiz questions. This is a fun program you can show your friends.

Let’s say I want to answer questions from [https://www.geeksforgeeks.org/string-data-structure/?ref=shm] this webpage. This webpage is all about strings in Python.

In the following example, we have used python as a programming language to answer questions from that particular webpage.

from requests_html import HTMLSession
session = HTMLSession()
res = session.get("https://www.geeksforgeeks.org/string-data-structure/?ref=shm")
answer = res.html.search('Strings are defined as an {} of characters.')[0]
print(answer)

The output of the code is the answer to the blank spaceoutput: array


different HTTP request methods in requests-html python

You can send different types of requests using the requests-html library in Python. Different types of requests to the server return different responses. To get the data from the server we use the get request.

HTTP delete request with the requests-html library in Python

we use the HTTP delete request to delete a resource from the server. To make an HTTP delete request with the requests-html library in python use the session.delete() function.

Example No 10: Making an HTTP delete request in python with the requests-html library

In the below example python code we have used the requests-html library to make an HTTP delete request to [https://httpbin.org/delete].

from requests_html import HTMLSession
session = HTMLSession()

url='https://httpbin.org/delete'
user_to_be_deleted ={
    "user":'alixaprodev'
}
response = session.delete(url, data=user_to_be_deleted)
print(f'Status Code:{response.status_code}')
print(f'Request Type : {response.request}')
  
## output  ##
# Status Code:200 
# Request Type : <PreparedRequest [DELETE]>

HTTP get request with parameters using the requests-html library in Python

HTTP GET request method is used to request a resource from the server. While you are making a get request the server does not change its state. This is normally used for retrieving data from a URL.To make a get request with requests-html in python, use the session.get() function.

Example No 11: In this example, we will be making a get request along with a parameter.

from requests_html import HTMLSession
session = HTMLSession()
# url to make a get request to
url='https://httpbin.org/get'
get_user ={
    "user":'alixaprodev'
}
# making get request
response = session.get(url, data=get_user)
print(f'Status Code:{response.status_code} ')
print(f'Request Type : {response.request}')

## output  ##
# Status Code:200 
# Request Type : <PreparedRequest [GET]>

HTTP Post request using the requests-html library in Python

HTTP post request is used to alter resources on the server. It is used to send data to the server in the header, not in the URL. To make a post request with requests-html in python, use the session.post() function.

Example No 12: Use requests-html library in python to make a Post request.

from requests_html import HTMLSession
session = HTMLSession()
# url to make a post request to
url='https://httpbin.org/post'
post_user ={
    "user":'alixaprodev',
    "pass":'password'
}
# making post request
response = session.post(url, data=post_user)
print(f'Content of Request:{response.content} ')
print(f'URL : {response.url}')

## output  ##
# Content of Request:b'{ \n  "form": {\n    "pass": "password", \n    "user": 
#"alixaprodev"\n  }
# URL : https://httpbin.org/post

Frequently Asked Question

requests-html is fun when it comes to web scraping. It has made my life easier. Some of the questions that people asked on different forums are following. that I wanted to answer.

what is the difference between beautifulsoup and requests_html?

Python beautifulsoup library is used for parsing HTML code and grabbing elements from HTML document while requests-html is  even more powerful library that can do HTTP requests to the server as well. requests_html combine the features of beautifulsoup and requests library.

what is the difference between the python requests module and requests_html?

requests module is used to make different types of HTTP requests to the server while requests_html is a more specialized version of the requests library, which can help us in HTML parsing and even solve the javascript rendering problem.

how do I use python to scrape a website?

To scrape a website in python, use the python requests-html module.

is web scraping legal?

No, Scrapping a website is not legal until the website owner gives you permission to. There are a lot of websites that do not want you to scrape but alternatively other want you to scape them. It depends on the website that you are scrapping.

Who Develop requests_html?

requests-html is a python library which is developed by kennethreitz


Errors and debugging

no module named ‘requests_html’

if You are facing this error. It means that you need to install the requests-html library. use the pip command to install requests-html.

2 thoughts on “Python requests-html – Learn Web scraping”

  1. Pingback: How to Use Requests-HTML Library in Python – alixaprodev.com

  2. 1. Is requests_html an alternative to selenium for scraping result of web searches using JavaScript and POST-requests?
    2. Do you think requests-html schould be (at least) the 8th item in projectpro.io/article/python-libraries-for-web-scraping/625 ?
    3. A very specific question:
    Do you think request_html can handle (get rid of the cookie banner) when trying to scrape (train) fare amount on reiseauskunft.bahn.de. The button to dismiss the cookie banner is “hidden” with a shadow-root-element. Using R and RSelenium I had to use remDr$executeScript(‘document.querySelector(“body>div”).shadowRoot.querySelector(“button.btn:nth-child(1)”).click()’)
    to get rid of it

Leave a Comment

Scroll to Top