Web Scrapping Using Python

PRACTICAL_1

Kaaviya Modi
4 min readJul 26, 2021

Introduction

The internet is an absolutely massive source of data — data that we can access using web scraping and Python!

In fact, web scraping is often the only way we can access data. There is a lot of information out there that isn’t available in convenient CSV exports or easy-to-connect APIs. And websites themselves are often valuable sources of data — consider, for example, the kinds of analysis you could do if you could download every post on a web forum.

To access those sorts of on-page datasets, we’ll have to use web scraping.

What Is Web Scraping?

Web scraping is the process of gathering information from the Internet. Even copying and pasting the lyrics of your favorite song is a form of web scraping! However, the words “web scraping” usually refer to a process that involves automation. Some websites don’t like it when automatic scrapers gather their data, while others don’t mind.

If you’re scraping a page respectfully for educational purposes, then you’re unlikely to have any problems. Still, it’s a good idea to do some research on your own and make sure that you’re not violating any Terms of Service before you start a large-scale project.

How Does Web Scraping Work?

When we scrape the web, we write code that sends a request to the server that’s hosting the page we specified. The server will return the source code — HTML, mostly — for the page (or pages) we requested.

So far, we’re essentially doing the same thing a web browser does — sending a server request with a specific URL and asking the server to return the code for that page.

But unlike a web browser, our web scraping code won’t interpret the page’s source code and display the page visually. Instead, we’ll write some custom code that filters through the page’s source code looking for specific elements we’ve specified, and extracting whatever content we’ve instructed it to extract.

For example, if we wanted to get all of the data from inside a table that was displayed on a web page, our code would be written to go through these steps in sequence:

  1. Request the content (source code) of a specific URL from the server
  2. Download the content that is returned
  3. Identify the elements of the page that are part of the table we want
  4. Extract and (if necessary) reformat those elements into a dataset we can analyze or use in whatever way we require.

If that all sounds very complicated, don’t worry! Python and Beautiful Soup have built-in features designed to make this relatively straightforward.

One thing that’s important to note: from a server’s perspective, requesting a page via web scraping is the same as loading it in a web browser. When we use code to submit these requests, we might be “loading” pages much faster than a regular user, and thus quickly eating up the website owner’s server resources.

Web Scrapping Example

  1. Find the URL that you want to scrape

In this blog we are going to scrape Flipkart website for product info.

https://www.flipkart.com/search?q=phone&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off

We will scrap data from above URL.

2. Inspecting the Page and find data that you want

Right click on website and click on inspect.

Find tag of data that you want to scrape.

3. Write the code

import libraries

import requests
from bs4 import BeautifulSoup
import pandas as pd

Store webpage content using requests and BeautifulSoup

url = "https://www.flipkart.com/search?q=phone&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off"
r = requests.get(url)
content = BeautifulSoup(r.content, 'html.parser')

Separate specific data from content using respective class-names and respective tags and then store that data into list.

products = []
ratings = []
prices = []

name = content.find_all('a', {"class": "s1Q9rs"})
rate = content.find_all('div', {"class": "_3LWZlK"})
price = content.find_all('div', {"class": "_30jeq3"})

for i in name:
products.append(i.text)
for i in range(len(products)):
ratings.append(rate[i].text)
for i in range(len(products)):
prices.append(price[i].text)

We can print or store data for further use in the form of csv file.

df = pd.DataFrame({'Product_Name': products, 'Price': prices, 'Rating': ratings})
print(df)
df.to_csv('products.csv', index=False, encoding='utf-8')

Output Data

products.csv file

For the entire code, check my GitHub

Thank You!

--

--

Kaaviya Modi

A graduating IT student interested in Business Analysis and management domain