Date:

Share:

Parsing XML with BeautifulSoup in Python

Related Articles

introduction

Extension Markup Language (XML) It is a popular markup language because of the way it constructs data. He found use of data transfer (representing ordered objects) and configuration files.

Despite JSON’s rising popularity, you can still find XML in the Android Development Manifest file, in Java / Maven build tools, and on the SOAP APIs on the Web. Thus, XML analysis is still a common task that a developer will have to do.

In Python, we can read and parse XML by leveraging two directories: Beautiful soup and LXML.

In this guide, we will review and extract data from XML files with Beautiful soup and LXML, And store the results using Pandas.

Setting up LXML and BeautifulSoup

First we need to install the two libraries. We will create a new folder in your work environment, set up a virtual environment and install the directories:

$ mkdir xml_parsing_tutorial
$ cd xml_parsing_tutorial
$ python3 -m venv env 
$ . env/bin/activate 
$ pip install lxml beautifulsoup4 

Now that everything’s set up, let’s do some surgery!

XML analysis with lxml and BeautifulSoup

Analysis always depends on the basic file and the structure it uses so there is none A single silver ball For all files. BeautifulSoup analyzes them automatically, but the basic elements depend on the task.

Hence, it is better to study surgery in a practical approach. Save the following XML in a file in your workbook – teachers.xml:

<?xml version="1.0" encoding="UTF-8"?>
<teachers>
    <teacher>
        <name>Sam Davies</name>
        <age>35</age>
        <subject>Maths</subject>
    </teacher>
    <teacher>
        <name>Cassie Stone</name>
        <age>24</age>
        <subject>Science</subject>
    </teacher>
    <teacher>
        <name>Derek Brandon</name>
        <age>32</age>
        <subject>History</subject>
    </teacher>
</teachers>

God <teachers> A tag indicates the root of the XML document, the <teacher> A tag is a descendant or sub-component of the <teachers></teachers>, With information about a single person. God <name>, <age>, <subject> They are children of <teacher> Tag, and the grandchildren of the <teachers> tag.

The first line, <?xml version="1.0" encoding="UTF-8"?>, In the sample document above is called an XML Prologue. It always comes at the beginning of an XML file, although it is completely optional to include an XML prologue in an XML document.

The XML prologue shown above indicates the XML version used and the type of character encoding. In this case, the characters in the XML document are encoded in UTF-8.

Now that we understand the structure of the XML file, we can parse it. Create a new file named teachers.py In your work library, and import the BeautifulSoup library:

from bs4 import BeautifulSoup

Note: As you may have noticed, we will not import lxml! With BeautifulSoup Import, LXML is automatically integrated, so separate import is not necessary, but is not installed as part of BeautifulSoup.

We will now read the contents of the XML file we created and save it in the variable called soup So we can start analyzing:

with open('teachers.xml', 'r') as f:
	file = f.read() 


soup = BeautifulSoup(file, 'xml')

God soup The variable now has the parsing content of our XML file. We can use this variable and the methods attached to it to retrieve the XML information with Python code.

Suppose we want to see only the names of the teachers from the XML document. We can get this information with a few lines of code:

names = soup.find_all('name')
for name in names:
    print(name.text)

running python teachers.py Will give us:

Sam Davis 
Cassie Stone 
Derek Brandon

God find_all() The method returns a list of all matching tags passed to it as an argument. As shown in the code above, soup.find_all('name') Returns all <name> Tags in the XML file. We then repeat these tags and print them text A property that contains the tag values.

View analyzed data in the table

Let’s take things one step further, analyze all the contents of the XML file and present it in tabular format.

Let’s rewrite the teachers.py File with:

from bs4 import BeautifulSoup


with open('teachers.xml', 'r') as f:
    file = f.read()


soup = BeautifulSoup(file, 'xml')


names = soup.find_all('name')


ages = soup.find_all('age')


subjects = soup.find_all('subject')


print('-'.center(35, '-'))
print('|' + 'Name'.center(15) + '|' + ' Age ' + '|' + 'Subject'.center(11) + '|')
for i in range(0, len(names)):
    print('-'.center(35, '-'))
    print(
        f'|names[i].text.center(15)|ages[i].text.center(5)|subjects[i].text.center(11)|')
print('-'.center(35, '-'))

The output of the code above will look like this:

-----------------------------------
|      Name     | Age |  Subject  |
-----------------------------------
|   Sam Davies  |  35 |   Maths   |
-----------------------------------
|  Cassie Stone |  24 |  Science  |
-----------------------------------
| Derek Brandon |  32 |  History  |
-----------------------------------

Refer to our practical and practical guide to learning Git, with best practices, industry-accepted standards and a cheat sheet included. Stop Google Git commands and actually Learn This!

Mazel Tov! You have just analyzed your first XML file with BeautifulSoup and LXML! Now that you’re more comfortable with the theory and the process, let’s try a more real example.

We designed the data as a table as a precursor to their storage in a multi-purpose data structure. That is – in the upcoming mini-project, we will store the data in Pandas DataFrame.

If you are not yet familiar with DataFrames – read our Python with Pandas: Guide to DataFrames!

RSS feed analysis and data storage in CSV

In this section, we will analyze the RSS feed of New York Times News, And store this data in a CSV file.

RSS Is an abbreviation of A really simple syndication. An RSS feed is a file that contains a summary of updates from a website and written in XML. In this case, RSS feed of The New York Times Contains a summary of daily news updates on their website. This summary contains links to new editions, links to article images, descriptions of news items and more. RSS feeds are also used to allow people to get data without scratching websites as a nice token of publishers.

Here’s a snapshot of an RSS feed from the New York Times:

You can access various New York Times RSS feeds of continents, states, regions, topics and other criteria through it Link.

It is important to see and understand the structure of the data before you can begin analyzing it. The data we would like to extract from the RSS feed for each news article are:

  • Global Unique ID (GUID)
  • title
  • Date published
  • Description

Now that we know the structure and have clear goals, let’s start our plan! We’ll need the requests The library and the pandas Directory to retrieve the data and easily convert it to a CSV file.

If you have not worked with requests Before, read the Python Request Module Guide!

With requests, We can send HTTP requests to sites and analyze responses. In this case, we can use it to retrieve their RSS feeds (in XML) so that BeautifulSoup can analyze it. With pandas, We can format the parsing data in the table, and finally store the contents of the table in a CSV file.

In the same work directory, Install requests and pandas (Your virtual environment should still be active):

$ pip install requests pandas

In a new file, nyt_rss_feed.py, Let’s import our libraries:

import requests
from bs4 import BeautifulSoup
import pandas as pd

Next, let’s send an HTTP request to the New York Times server to receive their RSS feed and retrieve its contents:

url = 'https://rss.nytimes.com/services/xml/rss/nyt/US.xml'
xml_data = requests.get(url).content 

With the code above, we were able to get a response from the HTTP request and store its contents in xml_data variable. God requests The library returns data approx bytes.

Now, create the following function to parse the XML data into a table in Pandas, using BeautifulSoup:

def parse_xml(xml_data):
  
    soup = BeautifulSoup(xml_data, 'xml')

  
    df = pd.DataFrame(columns=['guid', 'title', 'pubDate', 'description'])

  
    all_items = soup.find_all('item')
    items_length = len(all_items)
    
    for index, item in enumerate(all_items):
        guid = item.find('guid').text
        title = item.find('title').text
        pub_date = item.find('pubDate').text
        description = item.find('description').text

       
        row = 
            'guid': guid,
            'title': title,
            'pubDate': pub_date,
            'description': description
        

        df = df.append(row, ignore_index=True)
        print(f'Appending row %s of %s' % (index+1, items_length))

    return df

The function above parsing XML data requests HTTP with BeautifulSoup, storing its contents in a soup variable. Reference to Pandas DataFrame with rows and columns for the data we want to analyze using df variable.

Next, we repeat the XML file to find all the tags with them <item>. By iteration through the <item> Tag We are able to extract his children’s tags: <guid>, <title>, <pubDate>, And <description>. Notice how we use find() Method to achieve only one object. We add the values ​​of each child tag to the panda table.

Now, at the end of the file after the function, add these two lines of code to call the function and create a CSV file:

df = parse_xml(xml_data)
df.to_csv('news.csv')

run python nyt_rss_feed.py To create a new CSV file in your current work directory:

Appending row 1 of 24
Appending row 2 of 24
...
Appending row 24 of 24

The contents of the CSV file will look like this:

Note: Downloading the data may take a while depending on your internet connection and RSS feed. Data analysis may take some time depending on your CPU and memory resources. The feed we used is quite small so it should be processed quickly. Please be patient if you do not see results immediately.

Congratulations, you have successfully analyzed RSS feeds from the New York Times News and converted it to a CSV file!

Summary

In this tutorial, we learned how we can configure BeautifulSoup and LXML to parse XML files. We first started practicing by analyzing a simple XML file with teacher data, and then we analyzed the New York Times RSS feed, converting their data to a CSV file.

You can use these techniques to analyze other XML you may encounter, and convert them to different formats you need!

Source

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Popular Articles