Web Scraping Python Beginner



Introduction

Web scraping means data extraction from a web page. It can be manual or by some other means. We will do it by python script with BeautifulSoup library. Purpose of web scraping is diverse. Web scraping technique can be used to extract people’s contact information. Some times it is used to scrap product price and information from popular online shopping webpage. The implementation of web scraping technique is huge. Our today’s example of web scraping python tutorial has no commercial purpose.

  1. The web scraping method we discussed use “BeautifulSoup” and “urllib2” libraries in Python. That was a brief beginner’s guide to start using Python for web scraping. Stay tuned for our next article on how web scraping affects your revenue growth.
  2. Basically, every web page is returned from the server in an HTML format, meaning that our actual data is nicely packed inside HTML elements. It makes the whole process of retrieving specific data very easy and straightforward. This tutorial will be an ultimate guide for you to learn web scraping using Python programming language.
  3. Web Scraping for beginners: getting started with python Web scraping is the art of extracting data from a website in an automated and well-structured form. There could be different formats for scraping data like excel, CSV, and many more.
  4. Web Scraping with Python for Beginners In this tutorial, you’ll learn the process of web scraping with Python. Web scraping refers to extracting large amounts of data from websites and storing them locally in dynamic objects in an application, or in a database.
Web scraping python beautifulsoup tutorial

Importing python library

Some times we spend lots of time to choose a good movie to watch based on viewer’s ratings. My script will do this in minutes. The scripts will search on a pre specified web site on 5 pages (whatever you want). Each page contains at least 40 movies. So, the script will extract the movie name, release year, rating of around 200 movie. After that we will sort it based on the movie rating.

Python’s requests library allows us to send http request without manual labour. After that we parse the webpage by BeautifulSoup library. We will track the scraping time by the time library. We will need numpy library for array handling and pandas for data handling.

Web Scraping Python Beginner

Starting with single page web scraping

By the end of this tutorial, you will have a grasp of the essentials for extracting data from most of the websites on the internet. This includes the usage of BeautifulSoup for getting elements through patterns, Browser DevTools for pattern investigation, and Requests for managing the interface with the servers. This course will be useful for anyone dealing with extracting web data from pages.

We will scrap this https://123movies4u.cz/ web site. We will click on the movies for movies section. On genre we will click on Science Fiction. Now, on the bottom we will click on the page 2. You can click on any other page and any genre as you want. Now, we will copy the page URL and save it in “my_url” variable. At first we well set our script for one page. If it works for one page, then we will loop the script for 10 or 20 page or any other amount as we want. We will use requests library to send http request and get the webpage. After that the webpage will be parsed using BeautifulSoup library. Then, the http request will be closed. We will visit each movie webpage separately for scraping each movie information.

If we right click on a movie element and click on inspect. We will see something like the image above. All the movie item on this page are on ‘div’ tag, under the ‘class’ object of ‘ml-item’. So, we can pick all the 40 movie element in this page by “findAll” function and this is saved under mov_list. Now, we have all the 40 movie element, which have individual movie webpage URL inside. Now let’s pick number 5th movie element starting with zero. So, the 5th element’s index is 4. We can call that element by “mov_list[4]”. The individual movie webpage link is save under “a” tag. Everything inside a tag is dictionary element. Thus, individual movie links becomes “mov_list[4].a[‘href’]”. This is saved under “m_url” variable. Now we will send a http request to go inside a movie webpage. The whole code snippets is given below.

Going inside a movie webpage and scraping

The content of individual movie webpage is saved under a variable named “m_soup” and the webpage looks like the image of individual movie webpage. From this webpage we will scrap movie name, movie release year and IMDb rating.

At first we will save the movie name. After right click on the big movie name and clicking on the inspect option we see that the movie name is under “h3” tag. So, from “h3” tag, it is very simple to extract text. the movie name is saved under “movie_name” variable.

Now, we will scrap the release year. After going into the inspect option of “release” text, we see that, release year is under “a” tag. But “a” tag is not unique. There are a lot of “a” tag through out the webpage. So, we have to go two step up of that “a” tag to find a unique tag and class. If we go two step up, we find a “div” tag with “class” object “mvici-right”. We will pick that object first. After that, we will pick “a” tag and from that we will pick the release year.

Now, we will scrap the movie rating. After inspecting the rating text element on the page, we see that it is under “span” tag “class” object is “imdb-r”. After scraping the rating, it is saved under the variable “rating”. The code snippet is given below.

Web scraping the whole thing

Our code now capable of scraping one movie information. Now, it is time to put everyting thing together and scrap as much as we want. We will use two for loop. One loop will scrap over every movie on a single page. Other loop will go through the amount of page we will define. For this case I have scrapped 5 pages. Some movies don’t have release year and rating. In those case the code will crush. For that case I have put two if else condition. One for release year and one for the rating. The whole code is given below.

After scraping 200 movie we can see the top 10 movie list based on the rating.

Concluding remarks

Manually going through 200 movies and collecting 3 attributes of each movie may take days. Perhaps, implementing web scraping technique saves time. Furthermore, commercial demand of this technique is huge. This post is only for fun and intro tutorial for web scraping by python. In future I will come up with new web scraping python tutorial. The Github repository link for the project is given below.

You can also check my fiver gig for web scraping.

Introduction

Web

Web scraping means data extraction from a web page. It can be manual or by some other means. We will do it by python script with BeautifulSoup library. Purpose of web scraping is diverse. Web scraping technique can be used to extract people’s contact information. Some times it is used to scrap product price and information from popular online shopping webpage. The implementation of web scraping technique is huge. Our today’s example of web scraping python tutorial has no commercial purpose.

Importing python library

Some times we spend lots of time to choose a good movie to watch based on viewer’s ratings. My script will do this in minutes. The scripts will search on a pre specified web site on 5 pages (whatever you want). Each page contains at least 40 movies. So, the script will extract the movie name, release year, rating of around 200 movie. After that we will sort it based on the movie rating.

Python’s requests library allows us to send http request without manual labour. After that we parse the webpage by BeautifulSoup library. We will track the scraping time by the time library. We will need numpy library for array handling and pandas for data handling.

Starting with single page web scraping

We will scrap this https://123movies4u.cz/ web site. We will click on the movies for movies section. On genre we will click on Science Fiction. Now, on the bottom we will click on the page 2. You can click on any other page and any genre as you want. Now, we will copy the page URL and save it in “my_url” variable. At first we well set our script for one page. If it works for one page, then we will loop the script for 10 or 20 page or any other amount as we want. We will use requests library to send http request and get the webpage. After that the webpage will be parsed using BeautifulSoup library. Then, the http request will be closed. We will visit each movie webpage separately for scraping each movie information.

If we right click on a movie element and click on inspect. We will see something like the image above. All the movie item on this page are on ‘div’ tag, under the ‘class’ object of ‘ml-item’. So, we can pick all the 40 movie element in this page by “findAll” function and this is saved under mov_list. Now, we have all the 40 movie element, which have individual movie webpage URL inside. Now let’s pick number 5th movie element starting with zero. So, the 5th element’s index is 4. We can call that element by “mov_list[4]”. The individual movie webpage link is save under “a” tag. Everything inside a tag is dictionary element. Thus, individual movie links becomes “mov_list[4].a[‘href’]”. This is saved under “m_url” variable. Now we will send a http request to go inside a movie webpage. The whole code snippets is given below.

Scraping

Going inside a movie webpage and scraping

The content of individual movie webpage is saved under a variable named “m_soup” and the webpage looks like the image of individual movie webpage. From this webpage we will scrap movie name, movie release year and IMDb rating.

At first we will save the movie name. After right click on the big movie name and clicking on the inspect option we see that the movie name is under “h3” tag. So, from “h3” tag, it is very simple to extract text. the movie name is saved under “movie_name” variable.

Now, we will scrap the release year. After going into the inspect option of “release” text, we see that, release year is under “a” tag. But “a” tag is not unique. There are a lot of “a” tag through out the webpage. So, we have to go two step up of that “a” tag to find a unique tag and class. If we go two step up, we find a “div” tag with “class” object “mvici-right”. We will pick that object first. After that, we will pick “a” tag and from that we will pick the release year.

Now, we will scrap the movie rating. After inspecting the rating text element on the page, we see that it is under “span” tag “class” object is “imdb-r”. After scraping the rating, it is saved under the variable “rating”. The code snippet is given below.

Web Scraping Python Beginners

Web scraping the whole thing

Our code now capable of scraping one movie information. Now, it is time to put everyting thing together and scrap as much as we want. We will use two for loop. One loop will scrap over every movie on a single page. Other loop will go through the amount of page we will define. For this case I have scrapped 5 pages. Some movies don’t have release year and rating. In those case the code will crush. For that case I have put two if else condition. One for release year and one for the rating. The whole code is given below.

After scraping 200 movie we can see the top 10 movie list based on the rating.

Concluding remarks

Web Scraping Python Beginner Projects

Manually going through 200 movies and collecting 3 attributes of each movie may take days. Perhaps, implementing web scraping technique saves time. Furthermore, commercial demand of this technique is huge. This post is only for fun and intro tutorial for web scraping by python. In future I will come up with new web scraping python tutorial. The Github repository link for the project is given below.

You can also check my fiver gig for web scraping.