Web Scraping Basics¶

This notebook demonstrates web scraping and data extraction using Python.

Hands-on Lab : Web Scraping¶

Estimated time needed: 30 to 45 minutes

Objectives¶

In this lab you will perform the following:

  • Extract information from a given web site
  • Write the scraped data into a csv file.

Extract information from the given web site¶

You will extract the data from the below web site:

In [22]:
#this url contains the data you need to scrape
url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/labs/datasets/Programming_Languages.html"

The data you need to scrape is the name of the programming language and average annual salary.
It is a good idea to open the url in your web broswer and study the contents of the web page before you start to scrape.

Import the required libraries

In [23]:
# Your code here
from bs4 import BeautifulSoup # this module helps in web scrapping.
import requests  # this module helps us to download a web page

Download the webpage at the url

In [24]:
#your code goes here
# get the contents of the webpage in text format and store in a variable called data
data  = requests.get(url).text 

Create a soup object

In [25]:
#your code goes here
# soup = BeautifulSoup(data,"html5lib")  # create a soup object using the variable 'data'
soup = BeautifulSoup(data,"html.parser")  # create a soup object using the variable 'data'

Scrape the Language name and annual average salary.

In [26]:
#your code goes here

#find a html table in the web page
table = soup.find('table') # in html table is represented by the tag <table>
l = [] #create a list for save the value to import as csv
s = [] #create a list for save the value to import as csv
#Get all rows from the table
for row in table.find_all('tr'): # in html table row is represented by the tag <tr>
    # Get all columns in each row.
    cols = row.find_all('td') # in html a column is represented by the tag <td>
    language = cols[1].getText() # store the value in column 3 as language
    l.append(language)
    average_annual_salary = cols[3].getText() # store the value in column 4 as average annual salary
    s.append(average_annual_salary)
    print("{}--->{}".format(language,average_annual_salary))
Language--->Average Annual Salary
Python--->$114,383
Java--->$101,013
R--->$92,037
Javascript--->$110,981
Swift--->$130,801
C++--->$113,865
C#--->$88,726
PHP--->$84,727
SQL--->$84,793
Go--->$94,082

popular-languages.csvSave the scrapped data into a file named popular-languages.csv

In [27]:
# your code goes here
# from openpyxl import Workbook        # import Workbook class from module openpyxl

# wb=Workbook()                        # create a workbook object
# ws=wb.active                         # use the active worksheet

# for i in range(len(l)):
#     ws.append([l[i],s[i]])     # add a row with two columns 'language' and 'salary' value
# wb.save("popular-languages.csv")            # save the workbook into a file called popular-languages.csv
# print("Successfully Saved")
Successfully Saved
In [29]:
import csv
with open('popular-languages.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    for i in range(len(l)):
        writer.writerow([l[i],s[i]])