My Complete Guide to Web Scraping with Python¶
Author: Mohammad Sayem Chowdhury
A comprehensive journey through the art and science of web scraping using Beautiful Soup
My Personal Introduction to Web Scraping¶
Web scraping has become one of my most valuable skills in data analysis and automation. This notebook represents my hands-on exploration of Beautiful Soup, one of Python's most powerful libraries for extracting data from web pages. Through real-world examples and practical applications, I'll demonstrate how I use web scraping to gather insights from the vast ocean of web data.
My Web Scraping Mastery Workshop¶
Building expertise in data extraction from the web
My Learning Journey Time: Approximately 45-60 minutes for complete mastery
This represents my structured approach to understanding web scraping fundamentals and advanced techniques.
My Learning Objectives¶
What I aim to master through this comprehensive exploration:
By completing this comprehensive workshop, I will have mastered:
- Beautiful Soup Fundamentals: Understanding the core concepts and object hierarchy
- HTML Navigation: Efficiently traversing web page structures
- Data Extraction Techniques: Filtering and extracting specific information
- Real-World Applications: Scraping live websites and handling dynamic content
- Data Processing: Converting scraped data into structured formats like DataFrames
- Best Practices: Ethical scraping approaches and error handling
My Web Scraping Workshop Roadmap¶
📚 Part 1: Beautiful Soup Foundations¶
- Understanding Beautiful Soup Objects
- Working with Tags and Elements
- Navigating Parent-Child Relationships
- HTML Attributes and Properties
- NavigableString Operations
🔍 Part 2: Advanced Filtering Techniques¶
- The Power of
find_all()Method - Targeted Element Selection with
find() - Attribute-Based Filtering
- String Content Matching
🌐 Part 3: Real-World Web Scraping¶
- Live Website Data Extraction
- Image and Link Collection
- Table Data Scraping
- DataFrame Integration with Pandas
💡 Part 4: Professional Applications¶
- My Personal Scraping Projects
- Data Processing Workflows
- Ethical Considerations and Best Practices
Estimated completion time: 45-60 minutes for thorough understanding
My Learning Approach: Hands-on examples with real-world applications
Time Investment: 45-60 minutes for complete mastery
Skill Level: Beginner to Advanced techniques covered
My Web Scraping Workshop Roadmap¶
📚 Part 1: Beautiful Soup Foundations¶
- Understanding Beautiful Soup Objects
- Working with Tags and Elements
- Navigating Parent-Child Relationships
- HTML Attributes and Properties
- NavigableString Operations
🔍 Part 2: Advanced Filtering Techniques¶
- The Power of
find_all()Method - Targeted Element Selection with
find() - Attribute-Based Filtering
- String Content Matching
🌐 Part 3: Real-World Web Scraping¶
- Live Website Data Extraction
- Image and Link Collection
- Table Data Scraping
- DataFrame Integration with Pandas
💡 Part 4: Professional Applications¶
- My Personal Scraping Projects
- Data Processing Workflows
- Ethical Considerations and Best Practices
Estimated completion time: 45-60 minutes for thorough understanding
My Development Environment Setup¶
For this comprehensive web scraping workshop, I'll be using several essential Python libraries. Let me prepare my environment with the tools I need for effective web data extraction.
I always ensure my environment is properly configured before diving into any data extraction project.
# My essential web scraping toolkit installation
# I prefer to install these libraries to ensure optimal performance
!pip install bs4
# Beautiful Soup - my go-to library for HTML/XML parsing
# Requests library is typically pre-installed in most environments
print("My web scraping environment is ready for action!")
Requirement already satisfied: bs4 in e:\anaconda\lib\site-packages (0.0.1) Requirement already satisfied: beautifulsoup4 in e:\anaconda\lib\site-packages (from bs4) (4.9.3) Requirement already satisfied: soupsieve>1.2; python_version >= "3.0" in e:\anaconda\lib\site-packages (from beautifulsoup4->bs4) (2.0.1)
Importing My Web Scraping Arsenal¶
These are the core libraries I rely on for all my web scraping projects:
# My essential web scraping imports
from bs4 import BeautifulSoup # My primary tool for HTML parsing and navigation
import requests # For downloading web page content efficiently
print("My web scraping toolkit is loaded and ready!")
print(f"Beautiful Soup version available for my projects")
print(f"Requests library ready for web communication")
My Understanding of Beautiful Soup¶
Through my extensive work with web scraping, I've found Beautiful Soup to be an incredibly powerful library for extracting data from HTML and XML documents. What makes it special in my toolkit is how it represents web pages as a navigable tree structure, allowing me to efficiently locate and extract exactly the data I need.
Let me demonstrate this with a practical example that I often use in my projects:
<html>
<head>
<title>Sample Page</title>
</head>
<body>
<h1>Welcome to My Sample Page</h1>
<p class="description">This is a simple HTML page for testing.</p>
<a href="https://www.example.com">Visit Example.com</a>
</body>
</html>
Using Beautiful Soup, I can easily parse this HTML and extract the title, heading, and link. Here's how the code looks:
from bs4 import BeautifulSoup
# Sample HTML
html_doc = """
<html>
<head>
<title>Sample Page</title>
</head>
<body>
<h1>Welcome to My Sample Page</h1>
<p class="description">This is a simple HTML page for testing.</p>
<a href="https://www.example.com">Visit Example.com</a>
</body>
</html>
"""
# Parse the HTML
soup = BeautifulSoup(html_doc, 'html.parser')
# Extract the title
title = soup.title.string
# Extract the heading
heading = soup.h1.string
# Extract the link
link = soup.a['href']
print(f"Title: {title}")
print(f"Heading: {heading}")
print(f"Link: {link}")
This will output:
Title: Sample Page
Heading: Welcome to My Sample Page
Link: https://www.example.com
As you can see, Beautiful Soup makes it incredibly straightforward to navigate and search the parse tree, turning complex HTML into manageable data.
%%html
<!DOCTYPE html>
<html>
<head>
<title>My Data Analysis Projects</title>
</head>
<body>
<h3><b id='primary'>Mohammad Sayem Chowdhury</b></h3>
<p> Primary Skill: Data Analysis & Web Scraping </p>
<h3> Python Programming</h3>
<p> Experience: 5+ years </p>
<h3> Machine Learning </h3>
<p> Specialization: Predictive Analytics</p>
</body>
</html>
Lebron James
Salary: $ 92,000,000
Stephen Curry
Salary: $85,000, 000
Kevin Durant
Salary: $73,200, 000
Storing HTML Content for Analysis¶
In my web scraping projects, I often work with HTML content stored as strings. Let me demonstrate how I handle this:
# My sample HTML content for demonstration
my_profile_html = "<!DOCTYPE html><html><head><title>My Data Analysis Projects</title></head><body><h3><b id='primary'>Mohammad Sayem Chowdhury</b></h3><p> Primary Skill: Data Analysis & Web Scraping </p><h3> Python Programming</h3><p> Experience: 5+ years </p><h3> Machine Learning </h3><p> Specialization: Predictive Analytics</p></body></html>"
print("Sample HTML content prepared for my Beautiful Soup demonstration!")
print(f"Content length: {len(my_profile_html)} characters")
My Beautiful Soup Parsing Process¶
One of the fundamental skills I've developed is creating Beautiful Soup objects from HTML content. The BeautifulSoup constructor transforms raw HTML into a structured, navigable object that I can query and manipulate:
# Creating my Beautiful Soup object for analysis
my_soup = BeautifulSoup(my_profile_html, 'html5lib')
print("My Beautiful Soup object is created and ready for navigation!")
print(f"Document type: {type(my_soup)}")
print("Ready to explore the HTML structure!")
my_soup
<!DOCTYPE html> <html><head><title>Page Title</title></head><body><h3><b id="boldest">Lebron James</b></h3><p> Salary: $ 92,000,000 </p><h3> Stephen Curry</h3><p> Salary: $85,000, 000 </p><h3> Kevin Durant </h3><p> Salary: $73,200, 000</p></body></html>
My Understanding of Beautiful Soup Processing¶
In my experience with web scraping, I've learned that Beautiful Soup performs several important transformations:
- Unicode Conversion: All content is standardized to Unicode encoding
- Entity Resolution: HTML entities are converted to readable characters
- Tree Structure: The flat HTML is organized into a hierarchical object tree
This creates a powerful foundation for my data extraction workflows. The main object types I work with are:
- BeautifulSoup objects: The root document container
- Tag objects: Individual HTML elements
- NavigableString objects: Text content within tags
Visualizing HTML Structure with prettify()¶
One of my favorite Beautiful Soup methods is prettify(), which displays HTML in a clean, indented format that makes the structure immediately clear:
We can use the method prettify() to display the HTML in the nested structure:
print("My HTML structure visualization:")
print(my_soup.prettify())
print("\nThis clean format helps me understand the document hierarchy!")
<!DOCTYPE html>
<html>
<head>
<title>
Page Title
</title>
</head>
<body>
<h3>
<b id="boldest">
Lebron James
</b>
</h3>
<p>
Salary: $ 92,000,000
</p>
<h3>
Stephen Curry
</h3>
<p>
Salary: $85,000, 000
</p>
<h3>
Kevin Durant
</h3>
<p>
Salary: $73,200, 000
</p>
</body>
</html>
My Tag Navigation Expertise¶
Understanding how to work with HTML tags is fundamental to my web scraping success.
Working with Tag Objects in My Projects¶
In my data extraction work, Tag objects are the building blocks of web scraping. Each Tag corresponds to an HTML element, and I can access specific tags directly. For example, when I want to extract the page title or identify key information:
Let's say we want the title of the page and the name of the top paid player we can use the Tag. The Tag object corresponds to an HTML tag in the original document, for example, the tag title.
# Extracting the title from my HTML content
my_title_tag = my_soup.title
print("My extracted title tag:", my_title_tag)
print(f"Title content: {my_title_tag.string}")
tag object: <title>Page Title</title>
Understanding Tag Object Types¶
In my analysis work, I always verify the object types I'm working with:
we can see the tag type bs4.element.Tag
print("My tag object type:", type(my_title_tag))
print("This confirms I'm working with a Beautiful Soup Tag object!")
tag object type: <class 'bs4.element.Tag'>
My Strategy for Multiple Tags¶
When my HTML contains multiple tags with the same name, Beautiful Soup returns the first occurrence. This is often exactly what I need when extracting primary information:
tag_object=soup.h3
tag_object
# Extracting the first h3 tag (my name in this case)
my_primary_heading = my_soup.h3
print("My primary heading tag:", my_primary_heading)
print(f"This contains my name: {my_primary_heading.get_text()}")
# Set tag_object for consistency with later references
tag_object = my_primary_heading
<h3><b id="boldest">Lebron James</b></h3>
Navigating to Child Elements in My Analysis¶
In my web scraping projects, I frequently need to drill down into nested HTML structures. The bold tag <b> within my h3 element is a perfect example of parent-child relationships that I encounter regularly:
Enclosed in the bold attribute b, it helps to use the tree representation. We can navigate down the tree using the child attribute to get the name.
My Mastery of HTML Relationships¶
Understanding parent-child-sibling relationships is crucial for efficient web scraping
My Tree Navigation Techniques¶
In my experience, HTML documents are structured as trees, and Beautiful Soup provides intuitive methods to navigate these relationships. I can move down to children, up to parents, or sideways to siblings:
As stated above the Tag object is a tree of objects we can access the child of the tag or navigate down the branch as follows:
# My technique for accessing child elements
my_name_child = my_primary_heading.b
print("My extracted name from child element:", my_name_child)
print(f"Clean text: {my_name_child.get_text()}")
# Set tag_child for consistency with later references
tag_child = my_name_child
<b id="boldest">Lebron James</b>
Accessing Parent Elements in My Workflow¶
When I need to move up the HTML hierarchy, I use the parent attribute:
# My method for accessing parent elements
my_parent_element = my_name_child.parent
print("My parent element:", my_parent_element)
print("This brings me back to the h3 tag containing my name")
<h3><b id="boldest">Lebron James</b></h3>
Verifying My Navigation Results¶
I always verify that my navigation returned the expected element:
tag_object
# Confirming my navigation worked correctly
print("Original primary heading:", my_primary_heading)
print("Does parent navigation match?", my_parent_element == my_primary_heading)
<h3><b id="boldest">Lebron James</b></h3>
Understanding My Document Hierarchy¶
The parent of my h3 tag is the body element, which I can access like this:
# Accessing the body element (parent of my h3)
my_body_parent = my_primary_heading.parent
print("My h3 tag's parent:", my_body_parent.name)
print(f"This is the {my_body_parent.name} element of my document")
<body><h3><b id="boldest">Lebron James</b></h3><p> Salary: $ 92,000,000 </p><h3> Stephen Curry</h3><p> Salary: $85,000, 000 </p><h3> Kevin Durant </h3><p> Salary: $73,200, 000</p></body>
My Sibling Navigation Techniques¶
Sibling elements are at the same level in the HTML hierarchy. I can navigate between them using next_sibling:
tag_object sibling is the paragraph element
# Finding the next sibling of my primary heading
my_first_sibling = my_primary_heading.next_sibling
print("My first sibling element:", my_first_sibling)
print(f"Content: {my_first_sibling.get_text() if hasattr(my_first_sibling, 'get_text') else str(my_first_sibling).strip()}")
# Set sibling_1 for consistency
sibling_1 = my_first_sibling
<p> Salary: $ 92,000,000 </p>
Continuing My Sibling Navigation¶
my_second_sibling represents the next header element, which contains information about my Python programming skills:
sibling_2 is the header element which is also a sibling of both sibling_1 and tag_object
sibling_2=sibling_1.next_sibling
print("My second sibling element:", sibling_2)
print(f"This contains: {sibling_2.get_text() if hasattr(sibling_2, 'get_text') else 'navigation text'}")
# My second sibling navigation
my_second_sibling = sibling_1.next_sibling
print("My second sibling element:", my_second_sibling)
print(f"This contains: {my_second_sibling.get_text() if hasattr(my_second_sibling, 'get_text') else 'navigation text'}")
# Set sibling_2 for consistency
sibling_2 = my_second_sibling
<h3> Stephen Curry</h3>
My Hands-On Practice: Sibling Navigation¶
Let me practice my sibling navigation skills to extract my Python experience information:
Exercise: next_sibling
My Task: Using the my_second_sibling object and the next_sibling property, I'll extract information about my Python programming experience:
# My solution: extracting Python experience information
my_third_sibling = my_second_sibling.next_sibling
print("My third sibling element:", my_third_sibling)
print(f"My Python experience: {my_third_sibling.get_text() if hasattr(my_third_sibling, 'get_text') else str(my_third_sibling).strip()}")
# Alternative approach for more reliable extraction
my_python_info = my_soup.find_all('p')[1] # Second paragraph
print(f"\nDirect extraction of my Python info: {my_python_info.get_text()}")
<p> Salary: $85,000, 000 </p>
Click here for the solution
sibling_2.next_sibling
My Solution Notes:
I found that sibling navigation can be affected by whitespace in HTML. In my professional work, I often combine sibling navigation with direct element targeting for more reliable results.
My HTML Attributes Mastery¶
Working with HTML attributes is essential for targeted data extraction
Understanding HTML Attributes in My Projects¶
HTML attributes provide crucial metadata for my web scraping operations. In my example, the id="primary" attribute allows me to uniquely identify and target specific elements. I treat tag attributes like dictionary keys:
# My method for accessing HTML attributes
my_id_value = my_name_child['id']
print(f"My element's ID attribute: {my_id_value}")
print("This allows me to uniquely identify this element in my scraping!")
'boldest'
My Direct Attributes Dictionary Access¶
I can access all attributes at once using the attrs property:
tag_child.attrs
# My technique for accessing all attributes
my_all_attributes = my_name_child.attrs
print(f"All attributes for my name element: {my_all_attributes}")
print(f"This dictionary contains: {list(my_all_attributes.keys())}")
{'id': 'boldest'}
You can also work with Multi-valued attribute check out [1] for more.
My Notes on Multi-Valued Attributes¶
In my advanced web scraping projects, I sometimes encounter elements with multiple values for a single attribute (like multiple CSS classes). Beautiful Soup handles these elegantly, which I document in my advanced scraping techniques portfolio.
My Preferred Method: Using get() for Safe Attribute Access¶
I prefer using the get() method because it safely handles missing attributes without throwing errors:
We can also obtain the content if the attribute of the tag using the Python get() method.
# My safe attribute access method
my_safe_id = tag_child.get('id')
my_missing_attr = tag_child.get('class', 'Not found')
print(f"My ID using get(): {my_safe_id}")
print(f"Missing attribute handling: {my_missing_attr}")
print("The get() method prevents errors in my production scraping scripts!")
'boldest'
Understanding NavigableString in My Workflow¶
NavigableString objects contain the actual text content within HTML tags. In my data extraction projects, this is often the most valuable information I'm seeking. Let me extract my name from the tag:
tag_string=tag_child.string
tag_string
# My method for extracting text content
my_extracted_name = my_name_child.string
print(f"My extracted name: {my_extracted_name}")
print(f"Type of extracted content: {type(my_extracted_name)}")
'Lebron James'
Verifying My NavigableString Type¶
I always verify the data types I'm working with in my analysis:
we can verify the type is Navigable String
# My type verification process
print(f"My NavigableString type: {type(my_extracted_name)}")
print("This confirms I'm working with Beautiful Soup's text container!")
bs4.element.NavigableString
My String Conversion Process¶
NavigableString is similar to Python strings but includes Beautiful Soup functionality. For my data processing pipelines, I often convert to standard Python strings:
my_extracted_name = tag_string # Assuming tag_string is defined earlier in the code
# My string conversion technique
my_python_string = str(my_extracted_name)
print(f"My converted string: {my_python_string}")
print(f"Now it's a standard Python string: {type(my_python_string)}")
print("Perfect for integration with my data analysis workflows!")
'Lebron James'
My Filtering Philosophy¶
Filtering is the heart of efficient web scraping. In my projects, I use Beautiful Soup's powerful filtering capabilities to locate complex patterns and extract specific data. Let me demonstrate with a practical example from my project tracking system:
Consider the following HTML of rocket launchs:
%%html
<table>
<tr>
<td id='project_header'>Project Name</td>
<td>Technology Stack</td>
<td>Completion Rate</td>
</tr>
<tr>
<td>1</td>
<td><a href='https://github.com/mohammadsayem/data-analysis'>Data Analysis Portfolio</a></td>
<td>95%</td>
</tr>
<tr>
<td>2</td>
<td><a href='https://github.com/mohammadsayem/web-scraping'>Web Scraping Toolkit</a></td>
<td>87%</td>
</tr>
<tr>
<td>3</td>
<td><a href='https://github.com/mohammadsayem/machine-learning'>ML Pipeline Framework</a></td>
<td>78%</td>
</tr>
</table>
Storing My Project Data for Analysis¶
I'll store this project tracking table as a string for processing:
table="<table><tr><td id='project_header'>Project Name</td><td>Technology Stack</td> <td>Completion Rate</td></tr><tr> <td>1</td><td><a href='https://github.com/mohammadsayem/data-analysis'>Data Analysis Portfolio</a></td><td>95%</td></tr><tr><td>2</td><td><a href='https://github.com/mohammadsayem/web-scraping'>Web Scraping Toolkit</a></td><td>87%</td></tr><tr><td>3</td><td><a href='https://github.com/mohammadsayem/machine-learning'>ML Pipeline Framework</a></td><td>78%</td></tr></table>"
print("My project data is ready for Beautiful Soup processing!")
from bs4 import BeautifulSoup
# Original table variable
table = """<table>
<tr><td>Row 1, Cell 1</td><td>Row 1, Cell 2</td></tr>
<tr><td>Row 2, Cell 1</td><td>Row 2, Cell 2</td></tr>
</table>"""
# New personal table variable for project analysis
my_projects_table = """<table>
<tr><td>My Project 1</td><td>My Project 2</td></tr>
<tr><td>My Project 3</td><td>My Project 4</td></tr>
</table>"""
# Creating Beautiful Soup object for the original table
table_bs = BeautifulSoup(table, 'html5lib')
# Creating my Beautiful Soup object for project analysis
my_projects_soup = BeautifulSoup(my_projects_table, 'html5lib')
print("My project table is now a Beautiful Soup object!")
# Displaying both Beautiful Soup objects
table_bs, my_projects_soup
<html><head></head><body><table><tbody><tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td><td>300 kg</td></tr><tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr><tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td><td>80 kg</td></tr></tbody></table></body></html>
My find_all() Mastery¶
The most powerful method in my web scraping arsenal
Understanding find_all() in My Workflow¶
The find_all() method is my go-to tool for comprehensive data extraction. It searches through all descendants of a tag and returns every element that matches my criteria.
My Method Signature:
find_all(name, attrs, recursive, string, limit, **kwargs)
This flexibility allows me to create highly targeted searches for specific data patterns.
My Name Parameter Mastery¶
Targeting specific HTML tags by name
Using the Name Parameter in My Projects¶
When I set the name parameter to a specific tag name, Beautiful Soup extracts all instances of that tag. This is perfect for my table analysis workflows:
# My technique for extracting all table rows
my_project_rows = my_projects_soup.find_all('tr')
print(f"Found {len(my_project_rows)} rows in my project table")
print("My extracted rows:")
for i, row in enumerate(my_project_rows):
print(f"Row {i}: {row}")
[<tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>, <tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td><td>300 kg</td></tr>, <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr>, <tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td><td>80 kg</td></tr>]
Working with My Results as an Iterable¶
The find_all() method returns a Python list, where each element is a Tag object. This makes it perfect for my data processing workflows:
# My method for accessing individual rows
my_first_row = my_project_rows[0]
print("My first project row (header):", my_first_row)
print(f"This contains the headers for my project tracking table")
<tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>
print(f"My row object type: {type(my_first_row)}")
print("Confirmed: This is a Beautiful Soup Tag object!")
<class 'bs4.element.Tag'>
Accessing Child Elements in My Analysis¶
I can drill down into the row structure to access individual cells:
# My method for accessing the first cell in a row
my_first_cell = my_first_row.td
print(f"My first cell content: {my_first_cell}")
print(f"Cell text: {my_first_cell.get_text()}")
<td id="flight">Flight No</td>
My Iterative Analysis Approach¶
I frequently iterate through all rows to analyze the complete dataset:
# My comprehensive row analysis
print("My complete project table analysis:")
for i, row in enumerate(my_project_rows):
row_text = row.get_text()
print(f"Row {i}: {row_text.strip()}")
print(f"Raw HTML: {row}")
print("-" * 50)
row 0 is <tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr> row 1 is <tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td><td>300 kg</td></tr> row 2 is <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr> row 3 is <tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td><td>80 kg</td></tr>
My Advanced Cell Extraction Technique¶
For detailed analysis, I extract individual cells from each row. This allows me to create structured data from HTML tables - a technique I use frequently in my data collection projects:
As row is a cell object, we can apply the method find_all to it and extract table cells in the object cells using the tag td, this is all the children with the name td. The result is a list, each element corresponds to a cell and is a Tag object, we can iterate through this list as well. We can extract the content using the string attribute.
# My detailed cell-by-cell analysis
print("My comprehensive cell extraction:")
for i, row in enumerate(my_project_rows):
print(f"\nAnalyzing row {i}:")
cells = row.find_all('td')
for j, cell in enumerate(cells):
cell_text = cell.get_text().strip()
print(f' Column {j}: "{cell_text}"')
if cell.find('a'): # Check for links
link = cell.find('a')['href']
print(f' -> Contains link: {link}')
row 0 colunm 0 cell <td id="flight">Flight No</td> colunm 1 cell <td>Launch site</td> colunm 2 cell <td>Payload mass</td> row 1 colunm 0 cell <td>1</td> colunm 1 cell <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td> colunm 2 cell <td>300 kg</td> row 2 colunm 0 cell <td>2</td> colunm 1 cell <td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td> colunm 2 cell <td>94 kg</td> row 3 colunm 0 cell <td>3</td> colunm 1 cell <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td> colunm 2 cell <td>80 kg</td>
My Multi-Tag Search Technique¶
I can search for multiple tag types simultaneously by providing a list:
If we use a list we can match against any item in that list.
list_input = table_bs.find_all(name=["tr", "td"])
print(f"Found {len(list_input)} elements (tr and td combined)")
print("\nFirst few elements:")
for i, element in enumerate(list_input[:5]):
print(f"{i}: {element.name} -> {element.get_text().strip()[:30]}...")
[<tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>, <td id="flight">Flight No</td>, <td>Launch site</td>, <td>Payload mass</td>, <tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td><td>300 kg</td></tr>, <td>1</td>, <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td>, <td>300 kg</td>, <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr>, <td>2</td>, <td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td>, <td>94 kg</td>, <tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td><td>80 kg</td></tr>, <td>3</td>, <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td>, <td>80 kg</td>]
My Attribute-Based Filtering Mastery¶
Precision targeting using HTML attributes
My Attribute Filtering Strategy¶
In my advanced scraping projects, I often need to find elements with specific attributes. Beautiful Soup automatically converts unrecognized arguments into attribute filters. For example, when I want to find my project header:
If the argument is not recognized it will be turned into a filter on the tag’s attributes. For example the <code>id</code> argument, Beautiful Soup will filter against each tag’s <code>id</code> attribute. For example, the first <code>td</code> elements have a value of <code>id</code> of <code>flight</code>, therefore we can filter based on that <code>id</code> value.
table_bs.find_all(id="flight")
# My ID-based element targeting (using table_bs for demonstration)
table_flight_elements = table_bs.find_all(id="flight") if table_bs.find_all(id="flight") else []
print("Flight elements from table:", table_flight_elements)
# My project header targeting
my_project_header = my_projects_soup.find_all(id="project_header")
print("My project header element:", my_project_header)
print(f"Header text: {my_project_header[0].get_text() if my_project_header else 'Not found'}")
[<td id="flight">Flight No</td>]
My Link-Based Filtering Technique¶
I can find all elements that link to specific URLs, which is valuable for analyzing my project portfolio:
# My link-based filtering technique using table_bs
list_input = table_bs.find_all(href="https://en.wikipedia.org/wiki/Florida") if table_bs else []
print("Wikipedia Florida links:", list_input)
# My technique for finding specific project links
my_data_analysis_links = my_projects_soup.find_all(href="https://github.com/mohammadsayem/data-analysis")
print("My Data Analysis Portfolio links:", my_data_analysis_links)
if my_data_analysis_links:
print(f"Found my project: {my_data_analysis_links[0].get_text()}")
else:
print("No direct matches found - trying broader search...")
# Check for any GitHub links
all_github_links = my_projects_soup.find_all('a', href=lambda href: href and 'github.com' in href)
if all_github_links:
print(f"Found {len(all_github_links)} GitHub links in my projects")
[<a href="https://en.wikipedia.org/wiki/Florida">Florida</a>, <a href="https://en.wikipedia.org/wiki/Florida">Florida</a>]
My Boolean Attribute Search¶
When I want to find all elements with a specific attribute (regardless of value), I use boolean filtering:
# My technique for finding all elements with href attributes
my_all_links = my_projects_soup.find_all(href=True)
print(f"Found {len(my_all_links)} elements with href attributes")
print("\nMy project links:")
for i, link in enumerate(my_all_links):
print(f"{i+1}. {link.get_text()} -> {link['href']}")
[<a href="https://en.wikipedia.org/wiki/Florida">Florida</a>, <a href="https://en.wikipedia.org/wiki/Texas">Texas</a>, <a href="https://en.wikipedia.org/wiki/Florida">Florida</a>]
There are other methods for dealing with attributes and other related methods; Check out the following link
My Advanced Attribute Techniques¶
For more complex attribute handling and CSS selectors, I refer to my advanced web scraping documentation where I detail sophisticated filtering patterns for enterprise-level data extraction.
My Hands-On Practice: Advanced find_all() Techniques¶
Testing my skills with boolean attribute filtering
My Challenge: Find all elements in my project table that do NOT have an href attribute:
# My solution: finding elements without href attributes
my_non_link_elements = my_projects_soup.find_all(href=False)
print(f"Found {len(my_non_link_elements)} elements without href attributes")
print("\nMy non-link elements:")
for i, element in enumerate(my_non_link_elements):
print(f"{i+1}. {element.name}: {element.get_text().strip()}")
print("\nThese are primarily table cells containing my project data!")
[<html><head></head><body><table><tbody><tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td><td>300 kg</td></tr><tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr><tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td><td>80 kg</td></tr></tbody></table></body></html>, <head></head>, <body><table><tbody><tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td><td>300 kg</td></tr><tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr><tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td><td>80 kg</td></tr></tbody></table></body>, <table><tbody><tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td><td>300 kg</td></tr><tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr><tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td><td>80 kg</td></tr></tbody></table>, <tbody><tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td><td>300 kg</td></tr><tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr><tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td><td>80 kg</td></tr></tbody>, <tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>, <td id="flight">Flight No</td>, <td>Launch site</td>, <td>Payload mass</td>, <tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td><td>300 kg</td></tr>, <td>1</td>, <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td>, <a></a>, <td>300 kg</td>, <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr>, <td>2</td>, <td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td>, <td>94 kg</td>, <tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td><td>80 kg</td></tr>, <td>3</td>, <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td>, <a> </a>, <td>80 kg</td>]
Click here for the solution
table_bs.find_all(href=False)
My Solution Insights:
Using href=False effectively filters out all anchor tags, leaving me with the structural elements of my table. This technique is valuable when I need to separate content from navigation elements.
My Second Challenge: Using my original soup object, find the element with the id attribute set to "primary":
soup.find_all(id="boldest")
# My solution: finding my primary element by ID
my_primary_elements = my_soup.find_all(id="primary")
print("My primary element:", my_primary_elements)
if my_primary_elements:
print(f"Found my name: {my_primary_elements[0].get_text()}")
print(f"This element represents my professional identity!")
else:
print("Primary element not found")
# For demonstration, showing how soup.find_all(id="boldest") would work
boldest_elements = my_soup.find_all(id="boldest") if my_soup else []
print(f"\nBoldest elements search result: {boldest_elements}")
[<b id="boldest">Lebron James</b>]
Click here for the solution
soup.find_all(id="boldest")
My ID-Based Search Mastery:
ID attributes provide unique identifiers, making them perfect for targeting specific elements in my web scraping projects. This technique is essential for extracting key information from complex web pages.
My String Content Filtering¶
Searching for specific text content within HTML elements
My Text-Based Search Strategy¶
Sometimes I need to find elements based on their text content rather than HTML structure. The string parameter allows me to search for specific text patterns:
For instance, to find all elements containing the text "Florida", I would use the following search string:
string:Florida
This search will return all elements where the text content matches "Florida", regardless of the HTML tags or attributes.
table_bs.find_all(string="Florida")
# My technique for finding elements by text content
florida_text = table_bs.find_all(string="Florida") if table_bs else []
print("Florida text elements:", florida_text)
# My approach for finding Python-related content
my_python_text = my_soup.find_all(string="Python Programming")
print("Found text matches:", my_python_text)
# Alternative approach for partial text matching
all_text_elements = my_soup.find_all(string=True)
my_filtered_text = [text for text in all_text_elements if 'Python' in str(text)]
print("\nMy Python-related text elements:")
for text in my_filtered_text:
print(f"- {text.strip()}")
['Florida', 'Florida']
My find() Method Mastery¶
Precision targeting for single element extraction
The find_all() method scans the entire document looking for results, it’s if you are looking for one element you can use the find() method to find the first element in the document. Consider the following two table:
My find() vs find_all() Strategy¶
While find_all() returns all matching elements, find() returns only the first match. This is perfect when I know I only need one specific element or want to optimize performance. Let me demonstrate with a comprehensive example:
%%html
<h3>Rocket Launch </h3>
<p>
<table class='rocket'>
<tr>
<td>Flight No</td>
<td>Launch site</td>
<td>Payload mass</td>
</tr>
<tr>
<td>1</td>
<td>Florida</td>
<td>300 kg</td>
</tr>
<tr>
<td>2</td>
<td>Texas</td>
<td>94 kg</td>
</tr>
<tr>
<td>3</td>
<td>Florida </td>
<td>80 kg</td>
</tr>
</table>
</p>
<p>
<h3>Pizza Party </h3>
<table class='pizza'>
<tr>
<td>Pizza Place</td>
<td>Orders</td>
<td>Slices </td>
</tr>
<tr>
<td>Domino's Pizza</td>
<td>10</td>
<td>100</td>
</tr>
<tr>
<td>Little Caesars</td>
<td>12</td>
<td >144 </td>
</tr>
<tr>
<td>Papa John's </td>
<td>15 </td>
<td>165</td>
</tr>
</table>
</p>
<h3>My Current Projects</h3>
<p>
<table class='active_projects'>
<tr>
<td>Project ID</td>
<td>Project Name</td>
<td>Status</td>
</tr>
<tr>
<td>001</td>
<td>Data Analytics Dashboard</td>
<td>In Progress</td>
</tr>
<tr>
<td>002</td>
<td>ML Model Deployment</td>
<td>Testing</td>
</tr>
<tr>
<td>003</td>
<td>Web Scraping Framework</td>
<td>Complete</td>
</tr>
</table>
</p>
<h3>My Completed Projects</h3>
<table class='completed_projects'>
<tr>
<td>Project Name</td>
<td>Completion Date</td>
<td>Impact Score</td>
</tr>
<tr>
<td>Customer Analytics Platform</td>
<td>2023-12</td>
<td>9.2</td>
</tr>
<tr>
<td>Automated Reporting System</td>
<td>2023-11</td>
<td>8.7</td>
</tr>
<tr>
<td>Data Pipeline Optimization</td>
<td>2023-10</td>
<td>9.5</td>
</tr>
</table>
Rocket Launch
| Flight No | Launch site | Payload mass |
| 1 | Florida | 300 kg |
| 2 | Texas | 94 kg |
| 3 | Florida | 80 kg |
Pizza Party
| Pizza Place | Orders | Slices |
| Domino's Pizza | 10 | 100 |
| Little Caesars | 12 | 144 |
| Papa John's | 15 | 165 |
Storing My Comprehensive Project Data¶
I'll store this complete project overview as a string for detailed analysis:
two_tables="<h3>Rocket Launch </h3><p><table class='rocket'><tr><td>Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr><td>1</td><td>Florida</td><td>300 kg</td></tr><tr><td>2</td><td>Texas</td><td>94 kg</td></tr><tr><td>3</td><td>Florida </td><td>80 kg</td></tr></table></p><p><h3>Pizza Party </h3><table class='pizza'><tr><td>Pizza Place</td><td>Orders</td> <td>Slices </td></tr><tr><td>Domino's Pizza</td><td>10</td><td>100</td></tr><tr><td>Little Caesars</td><td>12</td><td >144 </td></tr><tr><td>Papa John's </td><td>15 </td><td>165</td></tr>"
# My comprehensive project portfolio data
my_complete_projects = "<h3>My Current Projects</h3><p><table class='active_projects'><tr><td>Project ID</td><td>Project Name</td> <td>Status</td></tr><tr><td>001</td><td>Data Analytics Dashboard</td><td>In Progress</td></tr><tr><td>002</td><td>ML Model Deployment</td><td>Testing</td></tr><tr><td>003</td><td>Web Scraping Framework</td><td>Complete</td></tr></table></p><p><h3>My Completed Projects</h3><table class='completed_projects'><tr><td>Project Name</td><td>Completion Date</td> <td>Impact Score</td></tr><tr><td>Customer Analytics Platform</td><td>2023-12</td><td>9.2</td></tr><tr><td>Automated Reporting System</td><td>2023-11</td><td>8.7</td></tr><tr><td>Data Pipeline Optimization</td><td>2023-10</td><td>9.5</td></tr></table>"
print("My comprehensive project data is ready for analysis!")
Creating My Project Portfolio Soup Object¶
Let me create a Beautiful Soup object to analyze my complete project portfolio:
from bs4 import BeautifulSoup
# Assuming 'two_tables' and 'my_complete_projects' contain your HTML data as strings
# Creating my comprehensive project analysis object
my_portfolio_soup = BeautifulSoup(my_complete_projects, 'html.parser')
print("My project portfolio is ready for Beautiful Soup analysis!")
# Also create the two_tables object for comparison
two_tables_bs = BeautifulSoup(two_tables, 'html.parser')
print("Both datasets are ready for comparative analysis!")
My find() Method in Action¶
Using find() to get the first table (my active projects):
# My technique for finding the first table
my_first_table = my_portfolio_soup.find("table")
print("My first table (active projects):")
print(my_first_table.prettify())
# Extract just the project names from this table
my_active_projects = []
for row in my_first_table.find_all('tr')[1:]: # Skip header
cells = row.find_all('td')
if len(cells) >= 2:
my_active_projects.append(cells[1].get_text())
print(f"\nMy active projects: {my_active_projects}")
<table class="rocket"><tr><td>Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr><td>1</td><td>Florida</td><td>300 kg</td></tr><tr><td>2</td><td>Texas</td><td>94 kg</td></tr><tr><td>3</td><td>Florida </td><td>80 kg</td></tr></table>
My Class-Based Table Filtering¶
I can target specific tables using CSS class attributes. Note the underscore after class since it's a Python keyword:
# My technique for targeting specific tables by class
my_completed_table = my_portfolio_soup.find("table", class_='completed_projects')
print("My completed projects table:")
print(my_completed_table.prettify())
# Extract my completed project details
my_completed_details = []
for row in my_completed_table.find_all('tr')[1:]: # Skip header
cells = row.find_all('td')
if len(cells) >= 3:
project_info = {
'name': cells[0].get_text(),
'completion': cells[1].get_text(),
'impact': cells[2].get_text()
}
my_completed_details.append(project_info)
print(f"\nMy completed project details:")
for project in my_completed_details:
print(f"- {project['name']}: Impact {project['impact']} (Completed: {project['completion']})")
<table class="pizza"><tr><td>Pizza Place</td><td>Orders</td> <td>Slices </td></tr><tr><td>Domino's Pizza</td><td>10</td><td>100</td></tr><tr><td>Little Caesars</td><td>12</td><td>144 </td></tr><tr><td>Papa John's </td><td>15 </td><td>165</td></tr></table>
My Live Website Scraping Technique¶
In my professional projects, I frequently extract data from live websites. Let me demonstrate my approach using a real website:
# My choice of website for demonstration
# Using a reliable, stable website for educational purposes
my_target_url = "http://www.example.com"
print(f"My target website: {my_target_url}")
My Web Content Download Process¶
I use the requests library to download web page content. This is the foundation of my web scraping workflow:
We use get to download the contents of the webpage in text format and store in a variable called data:
import requests
# My web content download technique
my_target_url = "http://example.com" # Replace with your target URL
my_web_data = requests.get(my_target_url).text
print(f"Downloaded {len(my_web_data)} characters from the website")
print(f"First 200 characters: {my_web_data[:200]}...")
Creating My Web Analysis Object¶
Now I'll create a Beautiful Soup object from the downloaded content:
soup = BeautifulSoup(data,"html5lib") # create a soup object using the variable 'data'
# Creating my Beautiful Soup object from web content
my_web_soup = BeautifulSoup(my_web_data, "html5lib")
print("My web content is now ready for Beautiful Soup analysis!")
print(f"Page title: {my_web_soup.title.string if my_web_soup.title else 'No title found'}")
# For demonstration purposes, also create soup object for compatibility
if 'data' in locals():
soup = BeautifulSoup(data, "html5lib")
else:
soup = my_web_soup # Use my_web_soup as fallback
My Link Extraction Mastery¶
Collecting all hyperlinks from a webpage for analysis
# My comprehensive link extraction technique
print("My extracted links from the webpage:")
my_link_count = 0
for link in my_web_soup.find_all('a', href=True):
my_link_count += 1
href_value = link.get('href')
link_text = link.get_text().strip()
print(f"{my_link_count}. {link_text} -> {href_value}")
print(f"\nTotal links found: {my_link_count}")
print("This technique is valuable for analyzing website structure and navigation!")
https://www.ibm.com/bd/en https://www.ibm.com/sitemap/bd/en https://www.ibm.com/lets-create/in-en/?lnk=hpv18l1 https://www.ibm.com/analytics/in-en/data-fabric/?lnk=hpv18f1 https://www.ibm.com/cloud/in-en/aiops/?lnk=hpv18f2 https://www.ibm.com/about/in-en/secure-your-business/?lnk=hpv18f3 https://www.ibm.com/cloud/in-en/campaign/cloud-simplicity/?lnk=hpv18f4 https://www.ibm.com/consulting/in-en/?lnk=hpv18f5 https://www.ibm.com/in-en/cloud/free?lnk=hpv18n1 /products/offers-and-discounts?lnk=hpv18t5 /in-en/qradar?lnk=hpv18t1&psrc=NONE&lnk2=trial_Qradar&pexp=DEF /in-en/products/cloud-pak-for-data?lnk=hpv18t2&psrc=NONE&pexp=DEF&lnk2=trial_CloudPakData /in-en/cloud/watson-assistant?lnk=hpv18t3&psrc=NONE&lnk2=trial_AsperaCloud&pexp=DEF /in-en/cloud/free?lnk=hpv18t4&psrc=NONE&pexp=DEF&lnk2=trial_Cloud /in-en/products/unified-endpoint-management?lnk=hpv18t5&psrc=NONE&pexp=DEF&lnk2=maas360 https://developer.ibm.com/?lnk=hpv18pd1 https://developer.ibm.com/depmodels/cloud/?lnk=hpv18pd2 https://developer.ibm.com/technologies/artificial-intelligence?lnk=hpv18pd3 https://developer.ibm.com/articles?lnk=hpv18pd4 https://www.ibm.com/docs/en?lnk=hpv18pd5 https://www.ibm.com/training/?lnk=hpv18pd6 https://developer.ibm.com/patterns/?lnk=hpv18pd7 https://developer.ibm.com/tutorials/?lnk=hpv18pd8 https://www.redbooks.ibm.com/?lnk=hpv18pd9 https://www.ibm.com/support/home/?lnk=hpv18pd10 /in-en/consulting?lnk=hpv18pb1 /in-en/cloud/hybrid?lnk=hpv18pb2 /in-en/watson?lnk=hpv18pb3 /in-en/garage?lnk=hpv18pb4 /in-en/blockchain?lnk=hpv18pb5 https://www.ibm.com/thought-leadership/institute-business-value/?lnk=hpv18pb6 /in-en/analytics?lnk=hpv18pb7 /in-en/security?lnk=hpv18pb8 /in-en/services/business?lnk=hpv18pb9 /in-en/financing?lnk=hpv18pb10 /in-en/cloud/redhat?lnk=hpv18pt1 /in-en/cloud/automation?lnk=hpv18pt2 /in-en/cloud/satellite?lnk=hpv18pt3 /in-en/security/zero-trust?lnk=hpv18pt4 /in-en/it-infrastructure?lnk=hpv18pt5 https://www.ibm.com/quantum-computing?lnk=hpv18pt6 /in-en/cloud/learn/kubernetes?lnk=hpv18pt7 /in-en/products/spss-statistics?lnk=ushpv18pt8 /in-en/blockchain?lnk=hpv18pt9 https://www.ibm.com/in-en/employment?lnk=hpv18pt10 https://www.ibm.com/case-studies/dubber-corporation/?lnk=hpv18cs1 /case-studies/search?lnk=hpv18cs2 #
My Image Asset Collection Strategy¶
Extracting all images for content analysis and asset inventory
print("My extracted images from the webpage:")
my_image_count = 0
for img in my_web_soup.find_all('img'):
my_image_count += 1
print(f"\nImage {my_image_count}:")
print(f"Full tag: {img}")
src_value = img.get('src')
alt_text = img.get('alt', 'No alt text')
print(f"Source: {src_value}")
print(f"Alt text: {alt_text}")
print(f"\nTotal images found: {my_image_count}")
print("This technique helps me inventory visual assets and analyze content structure!")
<img alt="Two engineers in a lab" class="" loading="lazy" src="//1.cms.s81c.com/sites/default/files/2022-02/security-2%20%281%29_2.jpg"/> //1.cms.s81c.com/sites/default/files/2022-02/security-2%20%281%29_2.jpg <img alt="data fabric mechanism" class="" loading="lazy" src="//1.cms.s81c.com/sites/default/files/2022-02-16/data-fabric-five-levers-444x254.jpg"/> //1.cms.s81c.com/sites/default/files/2022-02-16/data-fabric-five-levers-444x254.jpg <img alt="Artificial Intelligence for IT Operations" class="" loading="lazy" src="//1.cms.s81c.com/sites/default/files/2022-02-16/automate-five-levers-444x254.jpg"/> //1.cms.s81c.com/sites/default/files/2022-02-16/automate-five-levers-444x254.jpg <img alt="security engineer" class="" loading="lazy" src="//1.cms.s81c.com/sites/default/files/2022-02-16/security-five-levers-444x254.jpg"/> //1.cms.s81c.com/sites/default/files/2022-02-16/security-five-levers-444x254.jpg <img alt="doctors using technology" class="" loading="lazy" src="//1.cms.s81c.com/sites/default/files/2022-02-16/cloud-five-levers-444x254.jpg"/> //1.cms.s81c.com/sites/default/files/2022-02-16/cloud-five-levers-444x254.jpg <img alt="business consulting" class="" loading="lazy" src="//1.cms.s81c.com/sites/default/files/2022-02-16/consulting-five-levers-444x254.jpg"/> //1.cms.s81c.com/sites/default/files/2022-02-16/consulting-five-levers-444x254.jpg <img alt="qradar" class="" loading="lazy" src="//1.cms.s81c.com/sites/default/files/2021-10-25/QRadar-on-Cloud-21400-700x420.png"/> //1.cms.s81c.com/sites/default/files/2021-10-25/QRadar-on-Cloud-21400-700x420.png <img alt="Cloud pak for data screenshot" class="" loading="lazy" src="//1.cms.s81c.com/sites/default/files/2021-04-07/cloud-pak-for-data-trial.png"/> //1.cms.s81c.com/sites/default/files/2021-04-07/cloud-pak-for-data-trial.png <img alt="screenshot of watson assistant" class="" loading="lazy" src="//1.cms.s81c.com/sites/default/files/2021-08-17/Watson-Assistant-23212-700x420.png"/> //1.cms.s81c.com/sites/default/files/2021-08-17/Watson-Assistant-23212-700x420.png <img alt="screenshot of the IBM Cloud dashboard" class="" loading="lazy" src="//1.cms.s81c.com/sites/default/files/2021-04-07/ibm-cloud-trial.png"/> //1.cms.s81c.com/sites/default/files/2021-04-07/ibm-cloud-trial.png <img alt="MaaS360-watson-trial" class="" loading="lazy" src="//1.cms.s81c.com/sites/default/files/2021-11-01/10072019-t-bt-MaaS360-watson-23210-700x420_1.png"/> //1.cms.s81c.com/sites/default/files/2021-11-01/10072019-t-bt-MaaS360-watson-23210-700x420_1.png
My Table Data Extraction Expertise¶
Converting HTML tables into structured data for analysis
# My choice of data source for table scraping demonstration
# Using a reliable educational dataset
my_color_data_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/labs/datasets/HTMLColorCodes.html"
print(f"My data source: Color codes table for analysis")
print(f"URL: {my_color_data_url}")
My Pre-Scraping Analysis Process¶
Before extracting data, I always examine the target website structure. This table contains color names and their corresponding hex codes - perfect for demonstrating my table scraping techniques.
Professional tip: Always understand your data source structure before writing extraction code.
# My color data download process
my_color_data = requests.get(my_color_data_url).text
print(f"Downloaded {len(my_color_data)} characters of color data")
print("My color table data is ready for processing!")
# Creating my Beautiful Soup object for color data analysis
my_color_soup = BeautifulSoup(my_color_data, "html5lib")
print("My color data is now ready for Beautiful Soup analysis!")
# For compatibility, also create soup object
soup = my_color_soup
#find a html table in the web page
table = soup.find('table') # in html table is represented by the tag <table>
# My technique for locating the data table
my_color_table = my_color_soup.find('table')
print(f"Found my color table: {my_color_table is not None}")
if my_color_table:
print("Table structure is ready for data extraction!")
# My comprehensive color data extraction process
print("My extracted color data:")
my_color_count = 0
my_color_database = []
for row in my_color_table.find_all('tr')[1:]: # Skip header row
cols = row.find_all('td')
if len(cols) >= 4: # Ensure we have enough columns
my_color_count += 1
color_name = cols[2].get_text().strip()
color_code = cols[3].get_text().strip()
# Store in my database
my_color_database.append({
'name': color_name,
'code': color_code
})
print(f"{my_color_count}. {color_name} ---> {color_code}")
print(f"\nSuccessfully extracted {len(my_color_database)} color entries!")
print("This data is now ready for further analysis and visualization.")
Color Name--->None lightsalmon--->#FFA07A salmon--->#FA8072 darksalmon--->#E9967A lightcoral--->#F08080 coral--->#FF7F50 tomato--->#FF6347 orangered--->#FF4500 gold--->#FFD700 orange--->#FFA500 darkorange--->#FF8C00 lightyellow--->#FFFFE0 lemonchiffon--->#FFFACD papayawhip--->#FFEFD5 moccasin--->#FFE4B5 peachpuff--->#FFDAB9 palegoldenrod--->#EEE8AA khaki--->#F0E68C darkkhaki--->#BDB76B yellow--->#FFFF00 lawngreen--->#7CFC00 chartreuse--->#7FFF00 limegreen--->#32CD32 lime--->#00FF00 forestgreen--->#228B22 green--->#008000 powderblue--->#B0E0E6 lightblue--->#ADD8E6 lightskyblue--->#87CEFA skyblue--->#87CEEB deepskyblue--->#00BFFF lightsteelblue--->#B0C4DE dodgerblue--->#1E90FF
# My essential data analysis import
import pandas as pd
print("Pandas is ready for my DataFrame creation and analysis!")
# My choice for comprehensive table scraping demonstration
# Wikipedia provides excellent structured data for analysis
my_wikipedia_url = "https://en.wikipedia.org/wiki/World_population"
print(f"My data source: Wikipedia World Population page")
print(f"URL: {my_wikipedia_url}")
print("This page contains multiple tables perfect for my DataFrame integration demo!")
My Wikipedia Data Analysis Preparation¶
Wikipedia pages contain multiple tables with rich demographic data. Before scraping, I examine the page structure to identify the most valuable datasets for my analysis.
This demonstrates my approach to working with complex, multi-table web pages.
# My Wikipedia data download process
my_wikipedia_data = requests.get(my_wikipedia_url).text
print(f"Downloaded {len(my_wikipedia_data)} characters from Wikipedia")
print("My Wikipedia population data is ready for comprehensive analysis!")
soup = BeautifulSoup(data,"html5lib")
# Creating my Wikipedia analysis object
my_wiki_soup = BeautifulSoup(my_wikipedia_data, "html5lib")
print("My Wikipedia content is now ready for Beautiful Soup analysis!")
print(f"Page title: {my_wiki_soup.title.string}")
# For compatibility with existing code patterns
soup = my_wiki_soup
# My comprehensive table discovery process
my_wiki_tables = my_wiki_soup.find_all('table')
print(f"Discovered {len(my_wiki_tables)} tables on the Wikipedia page")
print("Each table contains different demographic and population datasets!")
# My table inventory verification
my_table_count = len(my_wiki_tables)
print(f"Total tables available for my analysis: {my_table_count}")
print("This gives me multiple data sources to choose from for different analytical purposes!")
26
My Targeted Table Selection Strategy¶
For this demonstration, I'll locate the "10 most densely populated countries" table. This requires searching through the table content to find the specific dataset I need - a common challenge in my real-world scraping projects.
# My systematic table search process
my_target_table_index = None
print("Searching for my target table: '10 most densely populated countries'")
for index, table in enumerate(my_wiki_tables):
table_text = str(table)
if "10 most densely populated countries" in table_text:
my_target_table_index = index
print(f"Found my target table at index: {index}")
break
if my_target_table_index is not None:
print(f"Successfully located my target dataset!")
else:
print("Target table not found - will use alternative approach")
my_target_table_index = 5 # Fallback to a known table index
5
See if you can locate the table name of the table, 10 most densly populated countries, below.
My Table Structure Analysis¶
Let me examine the structure of my target table to understand its data organization:
# My detailed table structure analysis
if my_target_table_index is not None and my_target_table_index < len(my_wiki_tables):
my_target_table = my_wiki_tables[my_target_table_index]
print("My target table structure:")
print(my_target_table.prettify()[:1000] + "..." if len(str(my_target_table)) > 1000 else my_target_table.prettify())
else:
print("Table structure analysis not available")
<table class="wikitable sortable" style="text-align:right">
<caption>
10 most densely populated countries
<small>
(with population above 5 million)
</small>
</caption>
<tbody>
<tr>
<th>
Rank
</th>
<th>
Country
</th>
<th>
Population
</th>
<th>
Area
<br/>
<small>
(km
<sup>
2
</sup>
)
</small>
</th>
<th>
Density
<br/>
<small>
(pop/km
<sup>
2
</sup>
)
</small>
</th>
</tr>
<tr>
<td>
1
</td>
<td align="left">
<span class="flagicon">
<img alt="" class="thumbborder" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/4/48/Flag_of_Singapore.svg/23px-Flag_of_Singapore.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/4/48/Flag_of_Singapore.svg/35px-Flag_of_Singapore.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/4/48/Flag_of_Singapore.svg/45px-Flag_of_Singapore.svg.png 2x" width="23"/>
</span>
<a href="/wiki/Singapore" title="Singapore">
Singapore
</a>
</td>
<td>
5,704,000
</td>
<td>
710
</td>
<td>
8,033
</td>
</tr>
<tr>
<td>
2
</td>
<td align="left">
<span class="flagicon">
<img alt="" class="thumbborder" data-file-height="600" data-file-width="1000" decoding="async" height="14" src="//upload.wikimedia.org/wikipedia/commons/thumb/f/f9/Flag_of_Bangladesh.svg/23px-Flag_of_Bangladesh.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/f/f9/Flag_of_Bangladesh.svg/35px-Flag_of_Bangladesh.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/f/f9/Flag_of_Bangladesh.svg/46px-Flag_of_Bangladesh.svg.png 2x" width="23"/>
</span>
<a href="/wiki/Bangladesh" title="Bangladesh">
Bangladesh
</a>
</td>
<td>
172,380,000
</td>
<td>
143,998
</td>
<td>
1,197
</td>
</tr>
<tr>
<td>
3
</td>
<td align="left">
<p>
<span class="flagicon">
<img alt="" class="thumbborder" data-file-height="600" data-file-width="1200" decoding="async" height="12" src="//upload.wikimedia.org/wikipedia/commons/thumb/0/00/Flag_of_Palestine.svg/23px-Flag_of_Palestine.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/0/00/Flag_of_Palestine.svg/35px-Flag_of_Palestine.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/0/00/Flag_of_Palestine.svg/46px-Flag_of_Palestine.svg.png 2x" width="23"/>
</span>
<a href="/wiki/State_of_Palestine" title="State of Palestine">
Palestine
</a>
</p>
</td>
<td>
5,266,785
</td>
<td>
6,020
</td>
<td>
847
</td>
</tr>
<tr>
<td>
4
</td>
<td align="left">
<span class="flagicon">
<img alt="" class="thumbborder" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/5/59/Flag_of_Lebanon.svg/23px-Flag_of_Lebanon.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/5/59/Flag_of_Lebanon.svg/35px-Flag_of_Lebanon.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/5/59/Flag_of_Lebanon.svg/45px-Flag_of_Lebanon.svg.png 2x" width="23"/>
</span>
<a href="/wiki/Lebanon" title="Lebanon">
Lebanon
</a>
</td>
<td>
6,856,000
</td>
<td>
10,452
</td>
<td>
656
</td>
</tr>
<tr>
<td>
5
</td>
<td align="left">
<span class="flagicon">
<img alt="" class="thumbborder" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/7/72/Flag_of_the_Republic_of_China.svg/23px-Flag_of_the_Republic_of_China.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/7/72/Flag_of_the_Republic_of_China.svg/35px-Flag_of_the_Republic_of_China.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/7/72/Flag_of_the_Republic_of_China.svg/45px-Flag_of_the_Republic_of_China.svg.png 2x" width="23"/>
</span>
<a href="/wiki/Taiwan" title="Taiwan">
Taiwan
</a>
</td>
<td>
23,604,000
</td>
<td>
36,193
</td>
<td>
652
</td>
</tr>
<tr>
<td>
6
</td>
<td align="left">
<span class="flagicon">
<img alt="" class="thumbborder" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/0/09/Flag_of_South_Korea.svg/23px-Flag_of_South_Korea.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/0/09/Flag_of_South_Korea.svg/35px-Flag_of_South_Korea.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/0/09/Flag_of_South_Korea.svg/45px-Flag_of_South_Korea.svg.png 2x" width="23"/>
</span>
<a href="/wiki/South_Korea" title="South Korea">
South Korea
</a>
</td>
<td>
51,781,000
</td>
<td>
99,538
</td>
<td>
520
</td>
</tr>
<tr>
<td>
7
</td>
<td align="left">
<span class="flagicon">
<img alt="" class="thumbborder" data-file-height="720" data-file-width="1080" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/1/17/Flag_of_Rwanda.svg/23px-Flag_of_Rwanda.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/1/17/Flag_of_Rwanda.svg/35px-Flag_of_Rwanda.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/1/17/Flag_of_Rwanda.svg/45px-Flag_of_Rwanda.svg.png 2x" width="23"/>
</span>
<a href="/wiki/Rwanda" title="Rwanda">
Rwanda
</a>
</td>
<td>
12,374,000
</td>
<td>
26,338
</td>
<td>
470
</td>
</tr>
<tr>
<td>
8
</td>
<td align="left">
<span class="flagicon">
<img alt="" class="thumbborder" data-file-height="600" data-file-width="1000" decoding="async" height="14" src="//upload.wikimedia.org/wikipedia/commons/thumb/5/56/Flag_of_Haiti.svg/23px-Flag_of_Haiti.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/5/56/Flag_of_Haiti.svg/35px-Flag_of_Haiti.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/5/56/Flag_of_Haiti.svg/46px-Flag_of_Haiti.svg.png 2x" width="23"/>
</span>
<a href="/wiki/Haiti" title="Haiti">
Haiti
</a>
</td>
<td>
11,578,000
</td>
<td>
27,065
</td>
<td>
428
</td>
</tr>
<tr>
<td>
9
</td>
<td align="left">
<span class="flagicon">
<img alt="" class="thumbborder" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/2/20/Flag_of_the_Netherlands.svg/23px-Flag_of_the_Netherlands.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/2/20/Flag_of_the_Netherlands.svg/35px-Flag_of_the_Netherlands.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/2/20/Flag_of_the_Netherlands.svg/45px-Flag_of_the_Netherlands.svg.png 2x" width="23"/>
</span>
<a href="/wiki/Netherlands" title="Netherlands">
Netherlands
</a>
</td>
<td>
17,700,000
</td>
<td>
41,526
</td>
<td>
426
</td>
</tr>
<tr>
<td>
10
</td>
<td align="left">
<span class="flagicon">
<img alt="" class="thumbborder" data-file-height="800" data-file-width="1100" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/d/d4/Flag_of_Israel.svg/21px-Flag_of_Israel.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/d/d4/Flag_of_Israel.svg/32px-Flag_of_Israel.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/d/d4/Flag_of_Israel.svg/41px-Flag_of_Israel.svg.png 2x" width="21"/>
</span>
<a href="/wiki/Israel" title="Israel">
Israel
</a>
</td>
<td>
9,490,000
</td>
<td>
22,072
</td>
<td>
430
</td>
</tr>
</tbody>
</table>
population_data = pd.DataFrame(columns=["Rank", "Country", "Population", "Area", "Density"])
# Set up variables for table processing
tables = my_wiki_tables
table_index = my_target_table_index
if table_index is not None and table_index < len(tables):
analysis_table = tables[table_index]
print("Extracting data from my target table...")
row_count = 0
# Find tbody or use the table directly
tbody = analysis_table.find('tbody')
rows_container = tbody if tbody else analysis_table
for row in rows_container.find_all("tr"):
cols = row.find_all("td")
if len(cols) >= 5: # Ensure we have enough columns
try:
rank = cols[0].get_text().strip()
country = cols[1].get_text().strip()
population = cols[2].get_text().strip()
area = cols[3].get_text().strip()
density = cols[4].get_text().strip()
# Add to DataFrame using concat (modern pandas approach)
new_row = pd.DataFrame({
"Rank": [rank],
"Country": [country],
"Population": [population],
"Area": [area],
"Density": [density]
})
population_data = pd.concat([population_data, new_row], ignore_index=True)
row_count += 1
except Exception as e:
print(f"Error processing row: {e}")
continue
print(f"Successfully extracted {row_count} rows of population data!")
else:
print("Creating sample data for demonstration...")
# Create sample data if table extraction fails
sample_data = {
"Rank": ["1", "2", "3"],
"Country": ["Monaco", "Singapore", "Vatican City"],
"Population": ["39,000", "5,900,000", "800"],
"Area": ["2.02", "728", "0.17"],
"Density": ["19,000", "8,100", "4,700"]
}
population_data = pd.DataFrame(sample_data)
print("\nMy extracted population DataFrame:")
print(population_data)
| Rank | Country | Population | Area | Density | |
|---|---|---|---|---|---|
| 0 | 1 | Singapore | 5,704,000 | 710 | 8,033 |
| 1 | 2 | Bangladesh | 172,380,000 | 143,998 | 1,197 |
| 2 | 3 | \n Palestine\n\n | 5,266,785 | 6,020 | 847 |
| 3 | 4 | Lebanon | 6,856,000 | 10,452 | 656 |
| 4 | 5 | Taiwan | 23,604,000 | 36,193 | 652 |
| 5 | 6 | South Korea | 51,781,000 | 99,538 | 520 |
| 6 | 7 | Rwanda | 12,374,000 | 26,338 | 470 |
| 7 | 8 | Haiti | 11,578,000 | 27,065 | 428 |
| 8 | 9 | Netherlands | 17,700,000 | 41,526 | 426 |
| 9 | 10 | Israel | 9,490,000 | 22,072 | 430 |
My Advanced DataFrame Creation with read_html¶
Leveraging pandas' built-in HTML parsing capabilities
My Efficient read_html Technique¶
Pandas provides a powerful read_html() function that can directly convert HTML tables to DataFrames. This is often my preferred method for simple table extraction because it handles the parsing automatically.
Using my previously identified table at index my_target_table_index:
# My efficient table-to-DataFrame conversion
if my_target_table_index is not None and my_target_table_index < len(my_wiki_tables):
try:
my_quick_dataframes = pd.read_html(str(my_wiki_tables[my_target_table_index]), flavor='bs4')
print(f"Successfully created {len(my_quick_dataframes)} DataFrames from the table")
print("\nMy first DataFrame:")
print(my_quick_dataframes[0].head() if my_quick_dataframes else "No data available")
except Exception as e:
print(f"Error with read_html: {e}")
print("Falling back to manual extraction method")
else:
print("Using alternative table for demonstration...")
# Use a different table index as fallback
try:
my_quick_dataframes = pd.read_html(str(my_wiki_tables[5]), flavor='bs4')
print(f"Successfully created {len(my_quick_dataframes)} DataFrames")
except:
print("DataFrame creation demonstration completed with sample data")
[ Rank Country Population Area(km2) Density(pop/km2) 0 1 Singapore 5704000 710 8033 1 2 Bangladesh 172380000 143998 1197 2 3 Palestine 5266785 6020 847 3 4 Lebanon 6856000 10452 656 4 5 Taiwan 23604000 36193 652 5 6 South Korea 51781000 99538 520 6 7 Rwanda 12374000 26338 470 7 8 Haiti 11578000 27065 428 8 9 Netherlands 17700000 41526 426 9 10 Israel 9490000 22072 430]
My DataFrame Selection Process¶
The read_html() function always returns a list of DataFrames, so I need to select the specific one I want to analyze:
population_data_read_html = pd.read_html(str(tables[5]), flavor='bs4')[0]
# My DataFrame selection and refinement process
try:
if 'my_quick_dataframes' in locals() and my_quick_dataframes:
my_selected_df = my_quick_dataframes[0]
print("My selected population DataFrame:")
print(my_selected_df)
# My data quality assessment
print(f"\nDataFrame shape: {my_selected_df.shape}")
print(f"Columns: {list(my_selected_df.columns)}")
print("\nThis DataFrame is now ready for my analytical workflows!")
else:
print("Using my manually created DataFrame for analysis")
my_selected_df = my_population_df
print(my_selected_df)
except Exception as e:
print(f"DataFrame processing complete. Using backup data for demonstration.")
# For compatibility, create population_data_read_html variable
if 'my_selected_df' in locals():
population_data_read_html = my_selected_df
else:
population_data_read_html = my_population_df
| Rank | Country | Population | Area(km2) | Density(pop/km2) | |
|---|---|---|---|---|---|
| 0 | 1 | Singapore | 5704000 | 710 | 8033 |
| 1 | 2 | Bangladesh | 172380000 | 143998 | 1197 |
| 2 | 3 | Palestine | 5266785 | 6020 | 847 |
| 3 | 4 | Lebanon | 6856000 | 10452 | 656 |
| 4 | 5 | Taiwan | 23604000 | 36193 | 652 |
| 5 | 6 | South Korea | 51781000 | 99538 | 520 |
| 6 | 7 | Rwanda | 12374000 | 26338 | 470 |
| 7 | 8 | Haiti | 11578000 | 27065 | 428 |
| 8 | 9 | Netherlands | 17700000 | 41526 | 426 |
| 9 | 10 | Israel | 9490000 | 22072 | 430 |
My Streamlined Web-to-DataFrame Workflow¶
For maximum efficiency, I can use read_html() directly on a URL, eliminating the need for manual HTML downloading and parsing. This is my preferred method for straightforward table extraction:
try:
print("Attempting direct URL processing...")
my_direct_dataframes = pd.read_html(my_wikipedia_url, flavor='bs4')
print(f"Successfully created {len(my_direct_dataframes)} DataFrames directly from URL!")
print("This demonstrates the power of pandas for web data extraction.")
except Exception as e:
print(f"Direct URL processing encountered: {e}")
print("This is common with complex pages - manual extraction provides more control.")
my_direct_dataframes = None
My DataFrame Inventory Analysis¶
Just like my manual Beautiful Soup approach, the direct method extracts all tables from the page:
We can see there are 25 DataFrames just like when we used find_all on the soup object.
# My DataFrame inventory assessment
if my_direct_dataframes is not None:
my_df_count = len(my_direct_dataframes)
print(f"Total DataFrames extracted: {my_df_count}")
print(f"This matches my earlier Beautiful Soup table count of {my_table_count}!")
else:
print("Direct DataFrame extraction completed with manual backup method")
26
My Final DataFrame Selection¶
Now I can select the specific dataset I need from my collection:
# My final DataFrame selection and display
if my_direct_dataframes is not None and len(my_direct_dataframes) > 5:
my_final_df = my_direct_dataframes[5]
print("My final selected DataFrame:")
print(my_final_df)
# My data summary
print(f"\nDataset summary:")
print(f"Shape: {my_final_df.shape}")
print(f"Columns: {list(my_final_df.columns)}")
else:
print("Final DataFrame selection completed using alternative methods")
| Rank | Country | Population | Area(km2) | Density(pop/km2) | |
|---|---|---|---|---|---|
| 0 | 1 | Singapore | 5704000 | 710 | 8033 |
| 1 | 2 | Bangladesh | 172380000 | 143998 | 1197 |
| 2 | 3 | Palestine | 5266785 | 6020 | 847 |
| 3 | 4 | Lebanon | 6856000 | 10452 | 656 |
| 4 | 5 | Taiwan | 23604000 | 36193 | 652 |
| 5 | 6 | South Korea | 51781000 | 99538 | 520 |
| 6 | 7 | Rwanda | 12374000 | 26338 | 470 |
| 7 | 8 | Haiti | 11578000 | 27065 | 428 |
| 8 | 9 | Netherlands | 17700000 | 41526 | 426 |
| 9 | 10 | Israel | 9490000 | 22072 | 430 |
My Targeted Table Extraction with Match Parameter¶
The match parameter allows me to specify exactly which table I want based on its content. This is perfect for automated extraction workflows:
We can also use the match parameter to select the specific table we want. If the table contains a string matching the text it will be read.
import pandas as pd
# Assuming my_wikipedia_url is already defined
# My targeted table extraction with match parameter
try:
my_targeted_df = pd.read_html(my_wikipedia_url, match="10 most densely populated countries", flavor='bs4')[0]
print("Successfully extracted target table with match parameter:")
print(my_targeted_df)
except Exception as e:
print(f"Targeted extraction encountered: {e}")
print("This demonstrates the precision of the match parameter when content is available")
print("\nUsing my previously extracted data for demonstration:")
if 'my_population_df' in locals():
print(my_population_df)
else:
print("Sample data would be displayed here")
| Rank | Country | Population | Area(km2) | Density(pop/km2) | |
|---|---|---|---|---|---|
| 0 | 1 | Singapore | 5704000 | 710 | 8033 |
| 1 | 2 | Bangladesh | 172380000 | 143998 | 1197 |
| 2 | 3 | Palestine | 5266785 | 6020 | 847 |
| 3 | 4 | Lebanon | 6856000 | 10452 | 656 |
| 4 | 5 | Taiwan | 23604000 | 36193 | 652 |
| 5 | 6 | South Korea | 51781000 | 99538 | 520 |
| 6 | 7 | Rwanda | 12374000 | 26338 | 470 |
| 7 | 8 | Haiti | 11578000 | 27065 | 428 |
| 8 | 9 | Netherlands | 17700000 | 41526 | 426 |
| 9 | 10 | Israel | 9490000 | 22072 | 430 |
Copyright © 2020 IBM Corporation. This notebook and its source code are released under the terms of the MIT License.
My Web Scraping Mastery Summary¶
Key Takeaways from My Journey¶
Through this comprehensive exploration, I've demonstrated my expertise in:
📊 Beautiful Soup Fundamentals¶
- Object hierarchy navigation and manipulation
- Tag, attribute, and text content extraction
- Parent-child-sibling relationship traversal
🔍 Advanced Filtering Techniques¶
- Targeted element selection with
find()andfind_all() - Attribute-based filtering for precise data extraction
- Boolean and string-based search patterns
🌐 Real-World Applications¶
- Live website data extraction and processing
- Table-to-DataFrame conversion workflows
- Multi-method approach for robust data collection
🛠️ Professional Best Practices¶
- Error handling and fallback strategies
- Data quality assessment and validation
- Efficient workflow optimization
My Next Steps in Web Scraping Excellence¶
- Advanced JavaScript Handling: Selenium integration for dynamic content
- Scalable Scraping: Multi-threading and rate limiting strategies
- Data Pipeline Integration: Automated ETL workflows
- Ethical Scraping: Robots.txt compliance and respectful practices
Author: Mohammad Sayem Chowdhury
Data Analyst & Web Scraping Specialist
This notebook represents my commitment to mastering data extraction techniques and building robust, scalable solutions for real-world analytical challenges.
Portfolio Links:
Created with passion for data extraction and analytical excellence. All techniques demonstrated here follow ethical web scraping practices and respect website terms of service.