XML (eXtensible Markup Language) is a popular format for storing and transporting data. With Python‘s built-in tools and libraries, working with XML files is straightforward. In this comprehensive guide, we‘ll explore how to parse, read, write, and modify XML using Python.
Parsing XML Files in Python
To start manipulating XML files, we first need to load and parse the content. Python‘s xml
module provides various APIs for this purpose. The commonly used ones are xml.etree.ElementTree
and xml.dom.minidom
. Let‘s look at examples of using both to parse an XML file.
Using xml.etree.ElementTree
ElementTree
provides a simple way to parse and create XML data. Here‘s how to use it:
import xml.etree.ElementTree as ETtree = ET.parse(‘data.xml‘)root = tree.getroot()
This loads the XML file, parses it and returns the root element. We can now traverse the tree to access elements and extract data.
For example, to print all item names:
for elem in root.findall(‘item‘): name = elem.get(‘name‘) print(name)
ElementTree
also makes it easy to manipulate the XML structure by adding, updating and removing elements.
Using xml.dom.minidom
xml.dom.minidom
uses the DOM (Document Object Model) approach to represent XML as a tree structure in memory for easy traversal and manipulation.
Here‘s an example to parse the XML and print some basic details:
import xml.dom.minidom as mddoc = md.parse(‘data.xml‘)print(doc.nodeName) print(doc.firstChild.tagName)
This prints the node name of root and its first child element. We can also fetch all elements of a tag name using doc.getElementsByTagName()
.
In summary, ElementTree
offers a simple API while minidom
allows closely examining and manipulating XML documents.
Reading XML Data in Python
Now let‘s discuss different ways to extract or read data from parsed XML using Python.
Accessing Elements
The getroot()
and getElementsByTagName()
methods return element objects. Each element has attributes/methods to explore its properties:
for employee in doc.getElementsByTagName(‘employee‘): id = employee.getAttribute(‘id‘) name = employee.getElementsByTagName(‘name‘)[0].childNodes[0].nodeValue role = employee.getElementsByTagName(‘role‘)[0].childNodes[0].nodeValue print(f‘ID: {id}, Name: {name}, Role: {role}‘)
This loops through employee elements, extracts the ID attribute, finds inner text of name and role tags to print employee details.
Finding Elements
We can search for elements in the parsed XML based on attributes, tag names or text values using .find()
and .findall()
:
# Get first expertise element with SQL value sql_skill = root.find(‘.//expertise[text()="SQL"]‘)# Get all expertise elements skills = root.findall(‘.//expertise‘)
Here .//
searches recursively from current node and []
has the matching condition.
XPath Queries
For more complex queries, XPath expressions can be used to find data:
xpath_query = "//employee/name/text()"name_nodes = doc.xpath(xpath_query)for node in name_nodes: print(node.nodeValue)
This prints names of all employee elements by evaluating the XPath query.
So in short, we have multiple options to search or extract any data from XML using Python.
Modifying and Writing XML
We often need to make changes by adding, updating or removing elements and attributes in XML files. Python has good support for this as well with xml.etree.ElementTree
and xml.dom.minidom
.
Let‘s look at examples of modifying the parsed XML tree and also writing it back to a file.
Updating Elements
To update some data:
for expertise in doc.getElementsByTagName(‘expertise‘): if expertise.getAttribute(‘name‘) == ‘Testing‘: expertise.setAttribute(‘name‘, ‘Software Testing‘) print(doc.toxml())
Here we change the name from "Testing" to "Software Testing" for expertise elements. The updated XML content is printed.
Adding Elements
We can also add new elements to the XML:
new_expertise = doc.createElement(‘expertise‘)new_expertise.setAttribute(‘name‘, ‘Big Data‘)doc.firstChild.appendChild(new_expertise)
This appends a new expertise node in the XML structure.
Deleting Nodes
Similarly, unwanted elements can be removed:
sql_skill = root.find(‘.//expertise[text()="SQL"]‘)root.remove(sql_skill)
This finds SQL expertise node and deletes it from the tree.
Writing XML Files
Finally, we can write the updated XML content back into a file:
with open(‘new_data.xml‘, ‘w‘) as f: f.write(doc.toprettyxml())
This properly formats and writes the XML to new_data.xml
.
So Python gives lots of flexibility to modify XML and persist changes.
Parsing Large XML Files
When dealing with large XML data, loading fully into memory with parse()
may not be feasible.
We can use streaming APIs like iterparse()
to handle such cases. It generates events while parsing and we handle each element separately.
Here‘s an example:
import xml.etree.ElementTree as ETcontext = ET.iterparse(‘large_data.xml‘, events=(‘start‘, ‘end‘)) # Iterate through elements for event, elem in context: if event == ‘start‘ and elem.tag == ‘item‘: print(f‘Reading item {elem.get("id")}‘) if event == ‘end‘: elem.clear() # discard elementcontext.close()
This minimized memory usage by clearing elements after reading. The same approach works for writing large XML files incrementally.
XML Schema Validation
We may want to validate XML against a schema XSD file to ensure integrity. The schema
package makes validation simple:
import schema xml_doc = ‘data.xml‘xsd_doc = ‘schema.xsd‘schema.validate(xml_doc, xsd_doc)
This raises ValidationError
if XML structure does not conform to the schema. We can handle them to take corrective actions.
Conclusion
In this guide, we explored various ways to parse, read, modify and write XML using Python‘s standard library modules mainly xml.etree.ElementTree
and xml.dom.minidom
.
Key points are:
- Use
ElementTree
for simple XML parsing and processing - Prefer
minidom
for advanced DOM based manipulation - Extract data via element attributes, tag names, XPath queries etc.
- Update XML tree in memory and write changes back to files
- Employ streaming for large XML data
- Validate against schema for catching errors
Python makes XML handling quite straightforward. Hope you find the techniques useful for your applications. Let me know if you have any other XML usage tricks with Python.