Web Scraping Review¶

This notebook reviews and practices web scraping techniques using Python.

Web Scraping Lab¶

Estimated time needed: 30 minutes

Objectives¶

After completing this lab you will be able to:

  • Download a webpage using requests module
  • Scrape all links from a web page
  • Scrape all image urls from a web page
  • Scrape data from html tables

Scrape www.ibm.com¶

Import the required modules and functions

In [1]:
from bs4 import BeautifulSoup # this module helps in web scrapping.
import requests  # this module helps us to download a web page

Download the contents of the web page

In [2]:
url = "http://www.ibm.com"
In [3]:
# get the contents of the webpage in text format and store in a variable called data
data  = requests.get(url).text 

Create a soup object using the class BeautifulSoup

In [6]:
# soup = BeautifulSoup(data,"html5lib")  # create a soup object using the variable 'data'
soup = BeautifulSoup(data,"html.parser")  # create a soup object using the variable 'data'

Scrape all links

In [7]:
for link in soup.find_all('a'):  # in html anchor/link is represented by the tag <a>
    print(link.get('href'))
https://www.ibm.com/bd/en
https://www.ibm.com/sitemap/bd/en
https://www.ibm.com/in-en/analytics/data-fabric?lnk=hpv18l1
https://newsroom.ibm.com/Update-on-our-actions-War-in-Ukraine?lnk=ushpv18nf1
https://www.ibm.com/in-en/about/secure-your-business?lnk=hpv18f1
https://www.ibm.com/it-infrastructure/us-en/resources/hybrid-multicloud-infrastructure-strategy/?lnk=hpv18f3
https://www.ibm.com/in-en/cloud/aiops/?lnk=hpv18f4
http://ibm.com/in-en/cloud/campaign/cloud-simplicity/?lnk=hpv18f5
/products/offers-and-discounts?lnk=hpv18t5
/in-en/products/planning-analytics?lnk=hpv18t4&psrc=NONE&lnk2=trial_AsperaCloud&pexp=DEF
/in-en/products/maximo?lnk=hpv18t1&psrc=NONE&pexp=DEF&lnk2=maas360
/in-en/qradar?lnk=hpv18t2&psrc=NONE&lnk2=trial_Qradar&pexp=DEF
/in-en/products/cloud-pak-for-data?lnk=hpv18t3&psrc=NONE&pexp=DEF&lnk2=trial_CloudPakData
/in-en/cloud/free?lnk=hpv18t4&psrc=NONE&pexp=DEF&lnk2=trial_Cloud
/in-en/cloud/watson-assistant?lnk=hpv18t4&psrc=NONE&lnk2=trial_AsperaCloud&pexp=DEF
https://developer.ibm.com/?lnk=hpv18pd1
https://developer.ibm.com/depmodels/cloud/?lnk=hpv18pd2
https://developer.ibm.com/technologies/artificial-intelligence?lnk=hpv18pd3
https://developer.ibm.com/articles?lnk=hpv18pd4
https://www.ibm.com/docs/en?lnk=hpv18pd5
https://www.ibm.com/training/?lnk=hpv18pd6
https://developer.ibm.com/patterns/?lnk=hpv18pd7
https://developer.ibm.com/tutorials/?lnk=hpv18pd8
https://www.redbooks.ibm.com/?lnk=hpv18pd9
https://www.ibm.com/support/home/?lnk=hpv18pd10
/in-en/analytics?lnk=hpv18pb1
/in-en/storage?lnk=hpv18pb2
/in-en/security?lnk=hpv18pb3
/in-en/consulting?lnk=hpv18pb4
/in-en/cloud/hybrid?lnk=hpv18pb5
/in-en/watson?lnk=hpv18pb6
/in-en/garage?lnk=hpv18pb7
/in-en/blockchain?lnk=hpv18pb8
https://www.ibm.com/thought-leadership/institute-business-value/?lnk=hpv18pb9
/in-en/financing?lnk=hpv18pb10
/in-en/cloud/redhat?lnk=hpv18pt1
/in-en/cloud/automation?lnk=hpv18pt2
/in-en/cloud/satellite?lnk=hpv18pt3
/in-en/security/zero-trust?lnk=hpv18pt4
/in-en/it-infrastructure?lnk=hpv18pt5
https://www.ibm.com/quantum-computing?lnk=hpv18pt6
/in-en/cloud/learn/kubernetes?lnk=hpv18pt7
/in-en/products/spss-statistics?lnk=ushpv18pt8
/in-en/blockchain?lnk=hpv18pt9
https://www.ibm.com/in-en/employment?lnk=hpv18pt10
https://www.ibm.com/case-studies/genus-power-infrastructures/?lnk=hpv18cs1
/case-studies/search?lnk=hpv18cs2
#

Scrape all images

In [8]:
for link in soup.find_all('img'):# in html image is represented by the tag <img>
    print(link.get('src'))
//1.cms.s81c.com/sites/default/files/2022-05-16/secure.jpeg
//1.cms.s81c.com/sites/default/files/2022-04-04/Original-20220316-26479-Forrester-Modernize-444x320.jpg
//1.cms.s81c.com/sites/default/files/2022-04-27/20220425-ls-automation-mobile-720x360-674x674_0.jpeg
//1.cms.s81c.com/sites/default/files/2022-05-16/Infra.jpeg
//1.cms.s81c.com/sites/default/files/2022-05-08/Planning-Analytics-22201-700x420.Original_0.png
//1.cms.s81c.com/sites/default/files/2022-03-28/Maximo.png
//1.cms.s81c.com/sites/default/files/2021-10-25/QRadar-on-Cloud-21400-700x420.png
//1.cms.s81c.com/sites/default/files/2021-04-07/cloud-pak-for-data-trial.png
//1.cms.s81c.com/sites/default/files/2021-04-07/ibm-cloud-trial.png
//1.cms.s81c.com/sites/default/files/2021-08-17/Watson-Assistant-23212-700x420.png

Scrape data from html tables¶

In [9]:
#The below url contains a html table with data about colors and color codes.
In [ ]:
url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/labs/datasets/HTMLColorCodes.html"

Before proceeding to scrape a web site, you need to examine the contents, and the way data is organized on the website. Open the above url in your browser and check how many rows and columns are there in the color table.

In [11]:
# get the contents of the webpage in text format and store in a variable called data
data  = requests.get(url).text
In [13]:
# soup = BeautifulSoup(data,"html5lib")
soup = BeautifulSoup(data,"html.parser")
In [14]:
#find a html table in the web page
table = soup.find('table') # in html table is represented by the tag <table>
In [15]:
#Get all rows from the table
for row in table.find_all('tr'): # in html table row is represented by the tag <tr>
    # Get all columns in each row.
    cols = row.find_all('td') # in html a column is represented by the tag <td>
    color_name = cols[2].getText() # store the value in column 3 as color_name
    color_code = cols[3].getText() # store the value in column 4 as color_code
    print("{}--->{}".format(color_name,color_code))
Color Name--->Hex Code#RRGGBB
lightsalmon--->#FFA07A
salmon--->#FA8072
darksalmon--->#E9967A
lightcoral--->#F08080
coral--->#FF7F50
tomato--->#FF6347
orangered--->#FF4500
gold--->#FFD700
orange--->#FFA500
darkorange--->#FF8C00
lightyellow--->#FFFFE0
lemonchiffon--->#FFFACD
papayawhip--->#FFEFD5
moccasin--->#FFE4B5
peachpuff--->#FFDAB9
palegoldenrod--->#EEE8AA
khaki--->#F0E68C
darkkhaki--->#BDB76B
yellow--->#FFFF00
lawngreen--->#7CFC00
chartreuse--->#7FFF00
limegreen--->#32CD32
lime--->#00FF00
forestgreen--->#228B22
green--->#008000
powderblue--->#B0E0E6
lightblue--->#ADD8E6
lightskyblue--->#87CEFA
skyblue--->#87CEEB
deepskyblue--->#00BFFF
lightsteelblue--->#B0C4DE
dodgerblue--->#1E90FF