Web Scraping Basics¶
This notebook demonstrates web scraping and data extraction using Python.
Hands-on Lab : Web Scraping¶
Estimated time needed: 30 to 45 minutes
Objectives¶
In this lab you will perform the following:
- Extract information from a given web site
- Write the scraped data into a csv file.
Extract information from the given web site¶
You will extract the data from the below web site:
In [22]:
#this url contains the data you need to scrape
url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/labs/datasets/Programming_Languages.html"
The data you need to scrape is the name of the programming language and average annual salary.
It is a good idea to open the url in your web broswer and study the contents of the web page before you start to scrape.
Import the required libraries
In [23]:
# Your code here
from bs4 import BeautifulSoup # this module helps in web scrapping.
import requests # this module helps us to download a web page
Download the webpage at the url
In [24]:
#your code goes here
# get the contents of the webpage in text format and store in a variable called data
data = requests.get(url).text
Create a soup object
In [25]:
#your code goes here
# soup = BeautifulSoup(data,"html5lib") # create a soup object using the variable 'data'
soup = BeautifulSoup(data,"html.parser") # create a soup object using the variable 'data'
Scrape the Language name and annual average salary.
In [26]:
#your code goes here
#find a html table in the web page
table = soup.find('table') # in html table is represented by the tag <table>
l = [] #create a list for save the value to import as csv
s = [] #create a list for save the value to import as csv
#Get all rows from the table
for row in table.find_all('tr'): # in html table row is represented by the tag <tr>
# Get all columns in each row.
cols = row.find_all('td') # in html a column is represented by the tag <td>
language = cols[1].getText() # store the value in column 3 as language
l.append(language)
average_annual_salary = cols[3].getText() # store the value in column 4 as average annual salary
s.append(average_annual_salary)
print("{}--->{}".format(language,average_annual_salary))
Language--->Average Annual Salary Python--->$114,383 Java--->$101,013 R--->$92,037 Javascript--->$110,981 Swift--->$130,801 C++--->$113,865 C#--->$88,726 PHP--->$84,727 SQL--->$84,793 Go--->$94,082
popular-languages.csvSave the scrapped data into a file named popular-languages.csv
In [27]:
# your code goes here
# from openpyxl import Workbook # import Workbook class from module openpyxl
# wb=Workbook() # create a workbook object
# ws=wb.active # use the active worksheet
# for i in range(len(l)):
# ws.append([l[i],s[i]]) # add a row with two columns 'language' and 'salary' value
# wb.save("popular-languages.csv") # save the workbook into a file called popular-languages.csv
# print("Successfully Saved")
Successfully Saved
In [29]:
import csv
with open('popular-languages.csv', 'w', newline='') as file:
writer = csv.writer(file)
for i in range(len(l)):
writer.writerow([l[i],s[i]])