Web Scraping Review¶
This notebook reviews and practices web scraping techniques using Python.
Web Scraping Lab¶
Estimated time needed: 30 minutes
Objectives¶
After completing this lab you will be able to:
- Download a webpage using requests module
- Scrape all links from a web page
- Scrape all image urls from a web page
- Scrape data from html tables
Scrape www.ibm.com¶
Import the required modules and functions
In [1]:
from bs4 import BeautifulSoup # this module helps in web scrapping.
import requests # this module helps us to download a web page
Download the contents of the web page
In [2]:
url = "http://www.ibm.com"
In [3]:
# get the contents of the webpage in text format and store in a variable called data
data = requests.get(url).text
Create a soup object using the class BeautifulSoup
In [6]:
# soup = BeautifulSoup(data,"html5lib") # create a soup object using the variable 'data'
soup = BeautifulSoup(data,"html.parser") # create a soup object using the variable 'data'
Scrape all links
In [7]:
for link in soup.find_all('a'): # in html anchor/link is represented by the tag <a>
print(link.get('href'))
https://www.ibm.com/bd/en https://www.ibm.com/sitemap/bd/en https://www.ibm.com/in-en/analytics/data-fabric?lnk=hpv18l1 https://newsroom.ibm.com/Update-on-our-actions-War-in-Ukraine?lnk=ushpv18nf1 https://www.ibm.com/in-en/about/secure-your-business?lnk=hpv18f1 https://www.ibm.com/it-infrastructure/us-en/resources/hybrid-multicloud-infrastructure-strategy/?lnk=hpv18f3 https://www.ibm.com/in-en/cloud/aiops/?lnk=hpv18f4 http://ibm.com/in-en/cloud/campaign/cloud-simplicity/?lnk=hpv18f5 /products/offers-and-discounts?lnk=hpv18t5 /in-en/products/planning-analytics?lnk=hpv18t4&psrc=NONE&lnk2=trial_AsperaCloud&pexp=DEF /in-en/products/maximo?lnk=hpv18t1&psrc=NONE&pexp=DEF&lnk2=maas360 /in-en/qradar?lnk=hpv18t2&psrc=NONE&lnk2=trial_Qradar&pexp=DEF /in-en/products/cloud-pak-for-data?lnk=hpv18t3&psrc=NONE&pexp=DEF&lnk2=trial_CloudPakData /in-en/cloud/free?lnk=hpv18t4&psrc=NONE&pexp=DEF&lnk2=trial_Cloud /in-en/cloud/watson-assistant?lnk=hpv18t4&psrc=NONE&lnk2=trial_AsperaCloud&pexp=DEF https://developer.ibm.com/?lnk=hpv18pd1 https://developer.ibm.com/depmodels/cloud/?lnk=hpv18pd2 https://developer.ibm.com/technologies/artificial-intelligence?lnk=hpv18pd3 https://developer.ibm.com/articles?lnk=hpv18pd4 https://www.ibm.com/docs/en?lnk=hpv18pd5 https://www.ibm.com/training/?lnk=hpv18pd6 https://developer.ibm.com/patterns/?lnk=hpv18pd7 https://developer.ibm.com/tutorials/?lnk=hpv18pd8 https://www.redbooks.ibm.com/?lnk=hpv18pd9 https://www.ibm.com/support/home/?lnk=hpv18pd10 /in-en/analytics?lnk=hpv18pb1 /in-en/storage?lnk=hpv18pb2 /in-en/security?lnk=hpv18pb3 /in-en/consulting?lnk=hpv18pb4 /in-en/cloud/hybrid?lnk=hpv18pb5 /in-en/watson?lnk=hpv18pb6 /in-en/garage?lnk=hpv18pb7 /in-en/blockchain?lnk=hpv18pb8 https://www.ibm.com/thought-leadership/institute-business-value/?lnk=hpv18pb9 /in-en/financing?lnk=hpv18pb10 /in-en/cloud/redhat?lnk=hpv18pt1 /in-en/cloud/automation?lnk=hpv18pt2 /in-en/cloud/satellite?lnk=hpv18pt3 /in-en/security/zero-trust?lnk=hpv18pt4 /in-en/it-infrastructure?lnk=hpv18pt5 https://www.ibm.com/quantum-computing?lnk=hpv18pt6 /in-en/cloud/learn/kubernetes?lnk=hpv18pt7 /in-en/products/spss-statistics?lnk=ushpv18pt8 /in-en/blockchain?lnk=hpv18pt9 https://www.ibm.com/in-en/employment?lnk=hpv18pt10 https://www.ibm.com/case-studies/genus-power-infrastructures/?lnk=hpv18cs1 /case-studies/search?lnk=hpv18cs2 #
Scrape all images
In [8]:
for link in soup.find_all('img'):# in html image is represented by the tag <img>
print(link.get('src'))
//1.cms.s81c.com/sites/default/files/2022-05-16/secure.jpeg //1.cms.s81c.com/sites/default/files/2022-04-04/Original-20220316-26479-Forrester-Modernize-444x320.jpg //1.cms.s81c.com/sites/default/files/2022-04-27/20220425-ls-automation-mobile-720x360-674x674_0.jpeg //1.cms.s81c.com/sites/default/files/2022-05-16/Infra.jpeg //1.cms.s81c.com/sites/default/files/2022-05-08/Planning-Analytics-22201-700x420.Original_0.png //1.cms.s81c.com/sites/default/files/2022-03-28/Maximo.png //1.cms.s81c.com/sites/default/files/2021-10-25/QRadar-on-Cloud-21400-700x420.png //1.cms.s81c.com/sites/default/files/2021-04-07/cloud-pak-for-data-trial.png //1.cms.s81c.com/sites/default/files/2021-04-07/ibm-cloud-trial.png //1.cms.s81c.com/sites/default/files/2021-08-17/Watson-Assistant-23212-700x420.png
Scrape data from html tables¶
In [9]:
#The below url contains a html table with data about colors and color codes.
In [ ]:
url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/labs/datasets/HTMLColorCodes.html"
Before proceeding to scrape a web site, you need to examine the contents, and the way data is organized on the website. Open the above url in your browser and check how many rows and columns are there in the color table.
In [11]:
# get the contents of the webpage in text format and store in a variable called data
data = requests.get(url).text
In [13]:
# soup = BeautifulSoup(data,"html5lib")
soup = BeautifulSoup(data,"html.parser")
In [14]:
#find a html table in the web page
table = soup.find('table') # in html table is represented by the tag <table>
In [15]:
#Get all rows from the table
for row in table.find_all('tr'): # in html table row is represented by the tag <tr>
# Get all columns in each row.
cols = row.find_all('td') # in html a column is represented by the tag <td>
color_name = cols[2].getText() # store the value in column 3 as color_name
color_code = cols[3].getText() # store the value in column 4 as color_code
print("{}--->{}".format(color_name,color_code))
Color Name--->Hex Code#RRGGBB lightsalmon--->#FFA07A salmon--->#FA8072 darksalmon--->#E9967A lightcoral--->#F08080 coral--->#FF7F50 tomato--->#FF6347 orangered--->#FF4500 gold--->#FFD700 orange--->#FFA500 darkorange--->#FF8C00 lightyellow--->#FFFFE0 lemonchiffon--->#FFFACD papayawhip--->#FFEFD5 moccasin--->#FFE4B5 peachpuff--->#FFDAB9 palegoldenrod--->#EEE8AA khaki--->#F0E68C darkkhaki--->#BDB76B yellow--->#FFFF00 lawngreen--->#7CFC00 chartreuse--->#7FFF00 limegreen--->#32CD32 lime--->#00FF00 forestgreen--->#228B22 green--->#008000 powderblue--->#B0E0E6 lightblue--->#ADD8E6 lightskyblue--->#87CEFA skyblue--->#87CEEB deepskyblue--->#00BFFF lightsteelblue--->#B0C4DE dodgerblue--->#1E90FF