Extension Markup Language (XML) It is a popular markup language because of the way it constructs data. He found use of data transfer (representing ordered objects) and configuration files.
Despite JSON’s rising popularity, you can still find XML in the Android Development Manifest file, in Java / Maven build tools, and on the SOAP APIs on the Web. Thus, XML analysis is still a common task that a developer will have to do.
In Python, we can read and parse XML by leveraging two directories: Beautiful soup and LXML.
In this guide, we will review and extract data from XML files with Beautiful soup and LXML, And store the results using Pandas.
Setting up LXML and BeautifulSoup
First we need to install the two libraries. We will create a new folder in your work environment, set up a virtual environment and install the directories:
mkdir xml_parsing_tutorial cd xml_parsing_tutorial python3 -m venv env . env/bin/activate pip install lxml beautifulsoup4
Now that everything’s set up, let’s do some surgery!
XML analysis with lxml and BeautifulSoup
Analysis always depends on the basic file and the structure it uses so there is none A single silver ball For all files. BeautifulSoup analyzes them automatically, but the basic elements depend on the task.
Hence, it is better to study surgery in a practical approach. Save the following XML in a file in your workbook –
<teachers> <teacher> <name>Sam Davies</name> <age>35</age> <subject>Maths</subject> </teacher> <teacher> <name>Cassie Stone</name> <age>24</age> <subject>Science</subject> </teacher> <teacher> <name>Derek Brandon</name> <age>32</age> <subject>History</subject> </teacher> </teachers>
<teachers> A tag indicates the root of the XML document, the
<teacher> A tag is a descendant or sub-component of the
<teachers></teachers>, With information about a single person. God
<subject> They are children of
<teacher> Tag, and the grandchildren of the
The first line,
<?xml version="1.0" encoding="UTF-8"?>, In the sample document above is called an XML Prologue. It always comes at the beginning of an XML file, although it is completely optional to include an XML prologue in an XML document.
The XML prologue shown above indicates the XML version used and the type of character encoding. In this case, the characters in the XML document are encoded in UTF-8.
Now that we understand the structure of the XML file, we can parse it. Create a new file named
teachers.py In your work library, and import the BeautifulSoup library:
from bs4 import BeautifulSoup
Note: As you may have noticed, we will not import
lxml! With BeautifulSoup Import, LXML is automatically integrated, so separate import is not necessary, but is not installed as part of BeautifulSoup.
We will now read the contents of the XML file we created and save it in the variable called
soup So we can start analyzing:
with open('teachers.xml', 'r') as f: file = f.read() soup = BeautifulSoup(file, 'xml')
soup The variable now has the parsing content of our XML file. We can use this variable and the methods attached to it to retrieve the XML information with Python code.
Suppose we want to see only the names of the teachers from the XML document. We can get this information with a few lines of code:
names = soup.find_all('name') for name in names: print(name.text)
python teachers.py Will give us:
Sam Davis Cassie Stone Derek Brandon
find_all() The method returns a list of all matching tags passed to it as an argument. As shown in the code above,
soup.find_all('name') Returns all
<name> Tags in the XML file. We then repeat these tags and print them
text A property that contains the tag values.
View analyzed data in the table
Let’s take things one step further, analyze all the contents of the XML file and present it in tabular format.
Let’s rewrite the
teachers.py File with:
from bs4 import BeautifulSoup with open('teachers.xml', 'r') as f: file = f.read() soup = BeautifulSoup(file, 'xml') names = soup.find_all('name') ages = soup.find_all('age') subjects = soup.find_all('subject') print('-'.center(35, '-')) print('|' + 'Name'.center(15) + '|' + ' Age ' + '|' + 'Subject'.center(11) + '|') for i in range(0, len(names)): print('-'.center(35, '-')) print( f'|names[i].text.center(15)|ages[i].text.center(5)|subjects[i].text.center(11)|') print('-'.center(35, '-'))
The output of the code above will look like this:
----------------------------------- | Name | Age | Subject | ----------------------------------- | Sam Davies | 35 | Maths | ----------------------------------- | Cassie Stone | 24 | Science | ----------------------------------- | Derek Brandon | 32 | History | -----------------------------------
Refer to our practical and practical guide to learning Git, with best practices, industry-accepted standards and a cheat sheet included. Stop Google Git commands and actually Learn This!
Mazel Tov! You have just analyzed your first XML file with BeautifulSoup and LXML! Now that you’re more comfortable with the theory and the process, let’s try a more real example.
We designed the data as a table as a precursor to their storage in a multi-purpose data structure. That is – in the upcoming mini-project, we will store the data in Pandas
If you are not yet familiar with DataFrames – read our Python with Pandas: Guide to DataFrames!
RSS feed analysis and data storage in CSV
In this section, we will analyze the RSS feed of New York Times News, And store this data in a CSV file.
RSS Is an abbreviation of A really simple syndication. An RSS feed is a file that contains a summary of updates from a website and written in XML. In this case, RSS feed of The New York Times Contains a summary of daily news updates on their website. This summary contains links to new editions, links to article images, descriptions of news items and more. RSS feeds are also used to allow people to get data without scratching websites as a nice token of publishers.
Here’s a snapshot of an RSS feed from the New York Times:
You can access various New York Times RSS feeds of continents, states, regions, topics and other criteria through it Link.
It is important to see and understand the structure of the data before you can begin analyzing it. The data we would like to extract from the RSS feed for each news article are:
- Global Unique ID (GUID)
- Date published
Now that we know the structure and have clear goals, let’s start our plan! We’ll need the
requests The library and the
pandas Directory to retrieve the data and easily convert it to a CSV file.
If you have not worked with
requestsBefore, read the Python Request Module Guide!
requests, We can send HTTP requests to sites and analyze responses. In this case, we can use it to retrieve their RSS feeds (in XML) so that BeautifulSoup can analyze it. With
pandas, We can format the parsing data in the table, and finally store the contents of the table in a CSV file.
In the same work directory, Install
pandas (Your virtual environment should still be active):
pip install requests pandas
In a new file,
nyt_rss_feed.py, Let’s import our libraries:
import requests from bs4 import BeautifulSoup import pandas as pd
Next, let’s send an HTTP request to the New York Times server to receive their RSS feed and retrieve its contents:
url = 'https://rss.nytimes.com/services/xml/rss/nyt/US.xml' xml_data = requests.get(url).content
With the code above, we were able to get a response from the HTTP request and store its contents in
xml_data variable. God
requests The library returns data approx
Now, create the following function to parse the XML data into a table in Pandas, using BeautifulSoup:
def parse_xml(xml_data): soup = BeautifulSoup(xml_data, 'xml') df = pd.DataFrame(columns=['guid', 'title', 'pubDate', 'description']) all_items = soup.find_all('item') items_length = len(all_items) for index, item in enumerate(all_items): guid = item.find('guid').text title = item.find('title').text pub_date = item.find('pubDate').text description = item.find('description').text row = 'guid': guid, 'title': title, 'pubDate': pub_date, 'description': description df = df.append(row, ignore_index=True) print(f'Appending row %s of %s' % (index+1, items_length)) return df
The function above parsing XML data requests HTTP with BeautifulSoup, storing its contents in a
soup variable. Reference to Pandas DataFrame with rows and columns for the data we want to analyze using
Next, we repeat the XML file to find all the tags with them
<item>. By iteration through the
<item> Tag We are able to extract his children’s tags:
<description>. Notice how we use
find() Method to achieve only one object. We add the values of each child tag to the panda table.
Now, at the end of the file after the function, add these two lines of code to call the function and create a CSV file:
df = parse_xml(xml_data) df.to_csv('news.csv')
python nyt_rss_feed.py To create a new CSV file in your current work directory:
Appending row 1 of 24 Appending row 2 of 24 ... Appending row 24 of 24
The contents of the CSV file will look like this:
Note: Downloading the data may take a while depending on your internet connection and RSS feed. Data analysis may take some time depending on your CPU and memory resources. The feed we used is quite small so it should be processed quickly. Please be patient if you do not see results immediately.
Congratulations, you have successfully analyzed RSS feeds from the New York Times News and converted it to a CSV file!
In this tutorial, we learned how we can configure BeautifulSoup and LXML to parse XML files. We first started practicing by analyzing a simple XML file with teacher data, and then we analyzed the New York Times RSS feed, converting their data to a CSV file.
You can use these techniques to analyze other XML you may encounter, and convert them to different formats you need!