XML stands for Extensible Markup Language. This is a markup language like HTML and it is used to store, structure, and transfer data between online servers.
XML is not suitable for displaying the data as the language is quite difficult to interpret by a normal user. It is very similar to HTML as far as the syntax of either of the languages go.
The basic difference between the two is that XML is used to store data and on the other hand, HTML is used to specify what data is going to be displayed and describes the structure of a webpage.
Python is a very flexible language, and considering the popularity of XML, it is quite possible that you might have to deal with XML several times while working for projects.
Let’s look at an XML dataset, we will then use Python to parse/read it.
<dataset> <persons> <person name = "Bob">person1</person> <person name = "Ross">person2</person> <person name = "Tom">person3</person> </persons> <jobs> <job name = "Electrician">job1</job> <job name = "Painter">job2</job> <job name = "Programmer">job3</job> </jobs> </dataset>
The above data contains information about people’s name and their job. We are not going to use a big dataset in this article, as this dataset would be enough to understand XML in Python.
Save the file
To parse XML code in Python, we will have to use the inbuilt xml module. From the XML module, you can follow two approaches to parse an XML file in Python:
- xml.etree.ElementTree as ET
Let’s understand the minidom approach.
minidom for XML Files
This module creates a Document Object Model (DOM) using the parse function. You can use the module as shown in the below example to read XML data.
Let’s understand the code:
- We instantiate an object using the minidom.parse class, the file is passed into the parse function that acts a Python XML parser to convert XML docs into readable Python objects.
- Now, if you want to find a particular tag, for example ‘person’, then you would use the getElementsByTagName() function.
- You can see in the data that there are three tags named ‘person’ and each of them has an attribute called ‘name’. We access that attribute and its value using dot separators.
We can print all of the attributes and their values from the document as below:
Let’s understand the code:
- You have two methods to get and print all of the attributes of an item in an XML data, one is to print each attribute one by one (like we did in the previous example), and other is to use a for loop.
2. Using XML’s firstChild property, we access the data of each person, and print it on the screen for every person.
Counting Elements using minidom
You can also obtain the number of elements in the XML document using the len() function which is in-built in Python.
ElementTree for XML Files
Surely, minidom is not the only way to use XML files in Python. There is a better approach called ElementTree that is an easy to use module in the xml.etree class.
XML contains data in a hierarchical or a tree-like structure in the file. We can say this type is a hybrid between a list and a dictionary.
Each element in the data has its own multiple properties tied to it:
- Tag: It is basically a Python string that tells us what kind of data the element contains or represents.
- Attributes: Attributes can be understood as normal variables that are only for that particular element. ElementTree converts them to a Python dictionary.
- Child Elements: Also seen and used in the minidom module described in the previous section, these are the sub elements of an element. The child elements are stored in an ordered set in Python.
To use this module, let’s import it first in our Python script.
Every XML document has one root element, it is the element which will contain all other elements of the XML document. To access the root element, this module provides a function named getroot().
In the example above, we store the value of root in the ‘root’ variable and when we print this variable, we get the name of ROOT and its storage address.
The fromstring() function in Python parses XML into an element from a string. This element is the root of the parsed XML document. Let us see how to get the name of our tree from an XML string using this function.
Lets understand the code:
- We create an empty dataset with the root element set as ‘data’. This data is in string form.
- The fromstring() function is used to get the root of the data that we created in the string form. Please note that the fromstring() function will not work on a non-string function.
- A root element has two components, its tag and its attributes. For now, our root element does not contain any attributes. We use the dot separator to access the tag and attributes of the root.
Accessing Attributes and Sub-Elements
The root element in the ElementTree is used to access the sub-elements of the XML dataset and their attributes.
This prints us the name of the persons’ name and their jobs. The root in the above case has 3 dimensions, so the properties of a 3-dimensional object like indexing can be used on the root element too!
The number of elements under the root or a sub-element can easily be obtained using the len() function similar to the minidom implementation.
Consider this for an example:
When you mix Python with xml, it adds a whole lot of functionality to an XML file.
Writing XML using Python
The ElementTree Module can also be used for writing data into XML files. The name of the node, attribute, and the value of attributes are modified in the example shown below.
A new XML file named ‘newPerson.xml’ is created in the same directory as your code’s and all the names of the persons in the file are replaced by the name ‘Bill’ who is the master of all trades.
Lets understand the code:
- To access all the elements in our XML file, we get the root of the XML file.
- Now, one by one we change all the attributes of the sub elements of the root. Starting with the node text and following all the way through by assigning name and tag of the new person.
- tree.write saves/writes our file as a copy alongside the original imported XML file.
The ElementTree provides a variety of ways to create new elements in an XML file. The first one being the makeelement() function. This function can take two parameters:
- Node name: Node name or the tag of the sub-element.
- A dictionary: The items in the dictionary act as the attributes of the sub-element.
You create a new node as a sub-element to the root. This node does not require attributes, so you are going to leave the dictionary empty.
Then you pass the name of the new node and an empty dictionary for attributes in the makeelement() function.
Now, to add elements in the node we just created above, you have to create a sub-node to that node. A dictionary is created with the necessary attributes. This brings us to the next method to create sub-elements. The SubElement() function.
SubElement() function is used in this case to create a new node, you save the XML file and that is it, you have added an entire node in your XML data with Python!
The factory SubElement() function is recommended over makeelement() function to create new nodes.
You have seen the ways you can create and modify elements in an XML file using Python. Now, let’s understand the way you can delete the elements in the XML file.
It can be also done using the same ElementTree module.
As you can see, all the details about the person named “Ross” have been removed.
You can also remove all of the sub-elements under a node with the help of the clear() function.
The clear() function has cleared all the sub-elements under the “job” branch. But, what if you only want to remove a single attribute from your XML data file. Well, that can also be done using this module.
Let’s see how.
Now, if you open the newly created file, you’ll see that the name attribute of the first element under the person’s name node does not exist anymore.
Conversion of XML to other data types
There are so many useful publicly available modules in Python that you can do countless things with this language. One of the modules is called xmltodict module. This module makes the tedious task of converting XML files to JSON or DICT super easy!
But first you’ll have to install this module as this module does not come pre-installed with Python.
You can install xmltodict module using the PIP(Python Package Index), here is how:
pip install xmltodict
It’s a very lightweight module, so this will not take much time. If the module is successfully installed, your command prompt should look something like this without any errors.
Python XML to Dict
There are many similarities between a dictionary and an XML data structure. Similar to XML, a dictionary can also be represented a tree structure. So, we can simply convert an XML to dictionary keys.
Python XML to JSON
In the section above, we wrote our XML data in the code itself, but that situation is not realistic and will not be always possible. On many instances, you will have to use database file and use Python to parse into your code. Only then it should be converted to another data structure. Let’s see how it can be done:
Lets understand the code:
- Python has a Pretty Print module named pprint that is used to print or display complex data structures in a readable manner. Here we use pretty print for the JSON file obtained from XML.
- Next, we open the file using the read() method. xmltodict.parse method is used to read the XML file.
- As we have our data in an organized manner with the help of pprint module, now we can dump the data to the JSON file format using the json.dumps() method from the Python json module.