3. Loading, Converting, and Writing JSON Files#
3.1. Introduction: Douglas Crockford’s JSON Saga#
JSON stands for JavaScript Object Notation, and it is one of the most frequently used data formats for transferring data over the Internet. If you will be using a web-based data transfer system, such as an API (more on that in the next module), you will be dealing with data in JSON format.
The JSON data format is attributed to Douglas Crockford, a Javascript architect at Yahoo!, although in his words: “I discovered JSON. I do not claim to have invented JSON, because it already existed in nature. What I did was I found it, I named it, I described how it was useful.” Here’s an excellent talk by Douglas Crockford in which he describes the origin of JSON and the early struggles to create a universal and lightweight language for data transference. The talk is about 50 minutes long, but it is worth watching if you have the time.
from IPython.display import IFrame
IFrame(src="https://www.youtube.com/embed/-C-JoyNuQJs", width="560", height="315")
For what it’s worth, Douglas Crockford calls the language “Jay-Sin”, like the first name, and not “Jay-Sawn”. Although he says “I strictly don’t care” about however people choose to pronounce JSON.
Crockford and his collaborators developed JSON in 2001 with the objective of creating a language to store, display, and transfer in a way that works on every browser. In other words, the whole point of JSON is to be universal, so that the data are readable no matter the browser or environment. JSON is also designed to be as lightweight as possible: that is, JSON code is as compact as possible. JSON sacrificies features that expand the functionality of the language, such as comments, processing instructions, or attributes, in order to use the minimal amount of code. The lightweight construction of JSON makes JSON code much faster than alternative data interchange languages like XML. It also helps achieve the goal of universality: a lightweight language works within existing infrastructures with no additional software or extraneous code because every programming language understands data in some form, and a minimalist data structure works with every language’s idea of what data should be. In Crockford’s words:
One of the key design goals behind JSON was minimalism. My idea was that the less we have to agree on in order to inter-operate, the more likely we’re going to be able to inter-operate well. If the interfaces are really simple, we can easily connect, and if the interfaces are really complicated, the likelihood that something’s going to go wrong goes way, way up. So I endeavored to make JSON as simple as possible.
Before we get to examples of using JSONs in Python, load the following libraries:
import numpy as np
import pandas as pd
import json
import requests
import sys
sys.tracebacklimit = 0 # turn off the long tracebacks on error messages
#sys.tracebacklimit = None # use to turn tracebacks back on if needed
3.2. The Structure of a JSON File#
The biggest difference between JSON (and other data interchange formats like XML, PHP, and YAML) and CSV is that JSON accomodates tree-based data structures, whereas CSV only handles tabular data. Many real-world applications require a tree-based structure for data. For example, here are the first two records from fake data from a business’s customer records:
users = requests.get("https://jsonplaceholder.typicode.com/users")
print(users.text[0:1110])
[
  {
    "id": 1,
    "name": "Leanne Graham",
    "username": "Bret",
    "email": "Sincere@april.biz",
    "address": {
      "street": "Kulas Light",
      "suite": "Apt. 556",
      "city": "Gwenborough",
      "zipcode": "92998-3874",
      "geo": {
        "lat": "-37.3159",
        "lng": "81.1496"
      }
    },
    "phone": "1-770-736-8031 x56442",
    "website": "hildegard.org",
    "company": {
      "name": "Romaguera-Crona",
      "catchPhrase": "Multi-layered client-server neural-net",
      "bs": "harness real-time e-markets"
    }
  },
  {
    "id": 2,
    "name": "Ervin Howell",
    "username": "Antonette",
    "email": "Shanna@melissa.tv",
    "address": {
      "street": "Victor Plains",
      "suite": "Suite 879",
      "city": "Wisokyburgh",
      "zipcode": "90566-7771",
      "geo": {
        "lat": "-43.9509",
        "lng": "-34.4618"
      }
    },
    "phone": "010-692-6593 x09125",
    "website": "anastasia.net",
    "company": {
      "name": "Deckow-Crist",
      "catchPhrase": "Proactive didactic contingency",
      "bs": "synergize scalable supply-chains"
    }
There are several elements of the JSON format that we need to discuss.
3.2.1. Lists, Sets, and Dictionaries#
First, notice the opening square brace [. This character tells Python to read all the following records as elements of a list. In a Python list, the order of the elements matters and elements can be repeated, but are not given names. The first element is denoted with index 0, the second element is denoted with index 1, and so on. For example:
my_list = [5,8,-9]
my_list[0]
5
Within the JSON syntax, each individual element of the list is a set, denoted by curly braces { and }. A Python set differs from a Python list in that the order of the elements does not matter (it sorts the elements automatically), it doesn’t allow repetition, and it allows the elements to be named. For example, in the following code, notice how the repeated 5s are removed and how the elements are sorted from smallest to largest:
my_set = {5,5,5,8,-9}
my_set
{-9, 5, 8}
It’s only possible to use call individual elements of a set if those elements are given distinct names:
my_set = {'larry':5, 'curly':8, 'moe':-9}
my_set['larry']
5
In Python, sets in which elements of a set have names are called dictionaries. The names are called keys, and the elements themselves are called values. So in this case, an element of the dictionary has a key of “larry” and a value of 5. Like sets, dictionaries enforce no repetition, but they apply this restriction to the keys only: multiple keys can have the same value, but every value must have a distinct key. In this example, I define the key “larry” twice, and only the last “larry” entered into the dictionary definition is saved:
my_set = {'larry':15, 'larry':10, 'curly':10, 'moe':-9}
my_set
{'larry': 10, 'curly': 10, 'moe': -9}
That means that JSON notation is the same as a list in which each element is a dictionary in Python. Every dictionary represents one record in the data, and when we convert the data to a tabular dataframe, each record will occupy one row in the data.
3.2.2. Nested Structures#
JSON can store data structures that are awkward or impossible for CSV. We can convert JSON formats to tabular formats to store the data in a data frame, but in doing so, we lose information about which features are nested within other fields.
Let’s return to the example of the customer records stored in JSON format. In these data, some features are themselves dictionaries that contain additional features, resulting in a nested and tree-like shape to the data. The first record looks like:
print(users.text[0:560])
[
  {
    "id": 1,
    "name": "Leanne Graham",
    "username": "Bret",
    "email": "Sincere@april.biz",
    "address": {
      "street": "Kulas Light",
      "suite": "Apt. 556",
      "city": "Gwenborough",
      "zipcode": "92998-3874",
      "geo": {
        "lat": "-37.3159",
        "lng": "81.1496"
      }
    },
    "phone": "1-770-736-8031 x56442",
    "website": "hildegard.org",
    "company": {
      "name": "Romaguera-Crona",
      "catchPhrase": "Multi-layered client-server neural-net",
      "bs": "harness real-time e-markets"
    }
  },
 
For this customer, we have the ID number, name, username, and email address. So far we can easily place these values in the same row of a table to store the same information. But the next key, address, is set equal to a dictionary, as indicated by the additional set of curly braces. Within the address field, we have data on the customer’s street, suite, city, and ZIP code. We also have a field, geo, equal to yet another dictionary that contains the latitude and longitude coordinates of the address. The structure then returns to the first level of nesting, providing the phone number and website, before introducing another branch for company with three sub-fields: name, catch phrase, and business slogan. In all, the structure of this JSON data is best described with the following tree:

This tree can only be placed into a table, as we must do to work with a data frame, if we lose the information about nesting and instead treat the 15 distinct features as 15 columns in the data frame.
3.2.3. Metadata#
In addition to nesting, JSON is an ideal format for placing metadata in the same file as the main data. Metadata is “data that provides information about other data”. Metadata might describe the date the data were last accessed, provide the stable URL for accessing the data, might credit authorship or describe copyright, and so on, prior to displaying information about each of the records in the data. In general, JSON formats with metadata can look like this:
 
In order to convert JSON data with a metadata format to a tabular data frame, you will have to specify the name of the branch that contains the records. As with nested structures, this procedure involves losing information. In this case, we will lose the metadata.
3.2.4. Missing Values and Different Data Types#
After unpacking the nested structure of a JSON data file and removing metadata, there can be differences from record to record in the information stored in the file. It’s possible that some records do not have data on the same set of features that other records do. Even if the same feature is present across records, it is possible that its value has one data type in some records and another data type in other records.
For example, consider public opinion data in which records are individuals answering questions on a survey and the features include an individual ID number, the individual’s rating of the Republican Candidate, the individual’s rating of the Democratic candidate, and the individual’s birthyear and gender. One possible way a few records can look like in JSON format (and then converted to a data frame) is:
case1 = '{"caseid":1.0,"ftrep":"Awful","ftdem":"Pretty good","birthyr":1960}'
case2 = '{"caseid":2.0,"ftrep":28.0,"ftdem":52.0,"birthyr":1987,"gender":2}'
case3 = '{"caseid":3.0,"ftrep":100.0,"ftdem":1.0,"gender":1}'
case_json = '[' + case1 + ',\n' + case2 + ',\n' + case3 + ']'
print(case_json)
[{"caseid":1.0,"ftrep":"Awful","ftdem":"Pretty good","birthyr":1960},
{"caseid":2.0,"ftrep":28.0,"ftdem":52.0,"birthyr":1987,"gender":2},
{"caseid":3.0,"ftrep":100.0,"ftdem":1.0,"gender":1}]
pd.read_json(case_json)
/var/folders/4w/k9sqqcbx4dxgpjtwgv_1m29h0000gq/T/ipykernel_14273/1653054356.py:1: FutureWarning: Passing literal json to 'read_json' is deprecated and will be removed in a future version. To read from a literal string, wrap it in a 'StringIO' object.
  pd.read_json(case_json)
| caseid | ftrep | ftdem | birthyr | gender | |
|---|---|---|---|---|---|
| 0 | 1 | Awful | Pretty good | 1960.0 | NaN | 
| 1 | 2 | 28.0 | 52.0 | 1987.0 | 2.0 | 
| 2 | 3 | 100.0 | 1.0 | NaN | 1.0 | 
Take a moment to look at the raw JSON formatted data. There are some differences between the three cases. The first case did not record the individual’s gender and the third case did not record the individual’s birthyear. Note that in the JSON format, these features are simply absent: we do not need to explicitly code these values as missing. When we convert the data to tabular format, the cells for the missing features are filled with NaN missing values automatically.
Additionally, the first case records the feeling thermometer scores for the Republican and Democrat with strings: this person feels “Awful” towards the Republican and “Pretty good” towards the Democrat. For the second and third cases these ratings are coded numerically. The JSON format naturally allows the same feature to be populated by data of a different type across records. Tabular data, however, has stricter requirements for data uniformity within a column: specifically all of the data must have the same type in a column. The data frame handles that by coding all of these values with the ambiguous object type:
pd.read_json(case_json).dtypes
/var/folders/4w/k9sqqcbx4dxgpjtwgv_1m29h0000gq/T/ipykernel_14273/2755144729.py:1: FutureWarning: Passing literal json to 'read_json' is deprecated and will be removed in a future version. To read from a literal string, wrap it in a 'StringIO' object.
  pd.read_json(case_json).dtypes
caseid       int64
ftrep       object
ftdem       object
birthyr    float64
gender     float64
dtype: object
In short, while it is straightforward to convert JSON data to tabular format, we can potentially lose a great deal of information in this conversion: nesting, metadata, and varying data type. In addition, tabular data inefficiently require that cells for missing values be filled with a symbol to denote missingness, to preserve the rectangular shape of the data frame.
3.3. Loading and Reading JSON Data in Python#
JSON, like CSV and ASCII, is a text-based system for data storage. When first loading JSON data into Python, the data will be read as a giant block of text. First we need to get Python to understand that the text is actually organized JSON data using the json library. Once we can work with the data as JSON, we can search through the JSON data’s index path to extract particular datapoints. We can even construct loops to pull out lists of values from across the records.
3.3.1. Using the requests.get(), json.loads() and json.dumps() Functions#
There is an important function from the requests library and two important functions in the json library for us to know when working with JSON data:
- requests.get()downloads data from a URL and stores it in Python’s memory as a giant block of text. It also can include additional parameters to send to the website to provide information on credentials or on specifying the subset of data to collect.
- json.loads()converts text into an object that Python recognizes as JSON-formatted data.
- json.dumps()converts JSON data into a block of text.
Consider again the JSON data on fake consumer records. The data are stored on a website called JSON Placeholder, which has several excellent example JSON datasets and has resources for guiding people who are writing their own APIs. The customer data exists here: https://jsonplaceholder.typicode.com/users. Take a moment and click on this link to see how the data appear in your web browser.
To download the data, use requests.get():
users = requests.get("https://jsonplaceholder.typicode.com/users")
To access the text of this download, use the .text attribute. If I want to display only the first record, I have to specify the first and last character numbers of this record, which turns out to be (after a lot of guess-and-checking) the first 560 characters:
print(users.text[0:560])
[
  {
    "id": 1,
    "name": "Leanne Graham",
    "username": "Bret",
    "email": "Sincere@april.biz",
    "address": {
      "street": "Kulas Light",
      "suite": "Apt. 556",
      "city": "Gwenborough",
      "zipcode": "92998-3874",
      "geo": {
        "lat": "-37.3159",
        "lng": "81.1496"
      }
    },
    "phone": "1-770-736-8031 x56442",
    "website": "hildegard.org",
    "company": {
      "name": "Romaguera-Crona",
      "catchPhrase": "Multi-layered client-server neural-net",
      "bs": "harness real-time e-markets"
    }
  },
 
Presently, although the text above looks like it is formatted like a JSON file, Python only understands it as text (with type str):
type(users.text)
str
To get Python to read this text as JSON data, make sure the json library is imported (which we did at the top of this notebook), and use the json.loads() function. Now, to display the first record, I only have to pass the index 0, representing the first item in a list:
users_json = json.loads(users.text)
users_json[0]
{'id': 1,
 'name': 'Leanne Graham',
 'username': 'Bret',
 'email': 'Sincere@april.biz',
 'address': {'street': 'Kulas Light',
  'suite': 'Apt. 556',
  'city': 'Gwenborough',
  'zipcode': '92998-3874',
  'geo': {'lat': '-37.3159', 'lng': '81.1496'}},
 'phone': '1-770-736-8031 x56442',
 'website': 'hildegard.org',
 'company': {'name': 'Romaguera-Crona',
  'catchPhrase': 'Multi-layered client-server neural-net',
  'bs': 'harness real-time e-markets'}}
Instead of text, Python now recognizes users_json as a list:
type(users_json)
list
3.3.2. Searching Along the JSON Index Path#
After using json.loads(), Python understands the JSON data to be a list-of-lists. We can now use indices to extract particular datapoints. The records are numbers, starting with 0, and keys within particular records can be called by name.
For example, to extract the email address of the fifth customer (element 4), in the data, I type:
users_json[4]['email']
'Lucio_Hettinger@annie.ca'
If we call fields that contain several nested features, Python returns a dictionary:
users_json[0]['address']
{'street': 'Kulas Light',
 'suite': 'Apt. 556',
 'city': 'Gwenborough',
 'zipcode': '92998-3874',
 'geo': {'lat': '-37.3159', 'lng': '81.1496'}}
To extract elements that are several levels of nesting down the JSON tree, specify the keys in order that lead to the desired element. For example, to extract the latitudinal coordinate, we navigate to the address, geo, and lat keys:
users_json[0]['address']['geo']['lat']
'-37.3159'
Finally, it may be necessary at times to covert the JSON back to text, maybe to assist in transfering the data to another platform (although I recommend using the .to_json() method, discussed below, for this purpose). To convert JSON back to text, use the json.dumps() function:
users_text = json.dumps(users_json)
type(users_text)
str
3.3.3. Looping Across Records to Extract Datapoints#
While JSON is a well-organized and flexible data format that preserves a lot of the meaningful context of the data, most analytical tools and statistical models can only operate on tabular data. So a major challenge with JSON data is converting the data to a tabular format to be saved in a data frame. One method is to construct a loop across records to extract desired elements.
Before we discuss this method, I do not recommend using this approach in general. Loops are usually slow, and it is much faster to perform these kinds of operations by vectorizing code. Vectorizing refers to the ability of a programming language to operate on an entire vector of data and work on several elements simultaneously, rather than on each element one at a time. Vectorization is the specialty of the numpy library, and most functions in pandas are vectorized as well. I suggest using pd.read_json() and pd.json_normalize() to convert JSON data to a data frame as they will be several orders of magnitude faster for converting the entire JSON to a data frame. Looping make sense only when we need to extract a small number of features from JSON data with a large feature set, because with the loop we directly call the features we want and we can avoid having to work with features we don’t need.
To start the loop, choose an index to represent one record. To display all of the email addresses, type:
for u in users_json:
    print(u['email'])
Sincere@april.biz
Shanna@melissa.tv
Nathan@yesenia.net
Julianne.OConner@kory.org
Lucio_Hettinger@annie.ca
Karley_Dach@jasper.info
Telly.Hoeger@billy.biz
Sherwood@rosamond.me
Chaim_McDermott@dana.io
Rey.Padberg@karina.biz
Alternatively, to place all of the email addresses in a list, you can loop inside of a list like this:
emails = [u['email'] for u in users_json]
emails
['Sincere@april.biz',
 'Shanna@melissa.tv',
 'Nathan@yesenia.net',
 'Julianne.OConner@kory.org',
 'Lucio_Hettinger@annie.ca',
 'Karley_Dach@jasper.info',
 'Telly.Hoeger@billy.biz',
 'Sherwood@rosamond.me',
 'Chaim_McDermott@dana.io',
 'Rey.Padberg@karina.biz']
The trick is extracting the data and saving it in a data frame at the same time. We can do this by using the pd.DataFrame() function, and using list loops inside this function:
users_df = pd.DataFrame(
    [u['name'], u['email'], u['company']['name']] for u in users_json
)
users_df.columns = ['name', 'email', 'company_name']
users_df
| name | company_name | ||
|---|---|---|---|
| 0 | Leanne Graham | Sincere@april.biz | Romaguera-Crona | 
| 1 | Ervin Howell | Shanna@melissa.tv | Deckow-Crist | 
| 2 | Clementine Bauch | Nathan@yesenia.net | Romaguera-Jacobson | 
| 3 | Patricia Lebsack | Julianne.OConner@kory.org | Robel-Corkery | 
| 4 | Chelsey Dietrich | Lucio_Hettinger@annie.ca | Keebler LLC | 
| 5 | Mrs. Dennis Schulist | Karley_Dach@jasper.info | Considine-Lockman | 
| 6 | Kurtis Weissnat | Telly.Hoeger@billy.biz | Johns Group | 
| 7 | Nicholas Runolfsdottir V | Sherwood@rosamond.me | Abernathy Group | 
| 8 | Glenna Reichert | Chaim_McDermott@dana.io | Yost and Sons | 
| 9 | Clementina DuBuque | Rey.Padberg@karina.biz | Hoeger LLC | 
Please, note, there are many many ways to construct a loop to extract JSON elements into a dataframe. If you look on Stack Overflow, for example, you will see many different approaches, and it can be confusing. Find an approach that you understand and feel comfortable using, and go with that. There’s not much difference between one loop and the next, as the real improvement comes from vectorization.
3.4. Using pd.read_json() and pd.json_normalize() to Store JSON Data in a Data Frame#
If there aren’t too many features in the data, or if you want to keep all of the features in the data anyway, then the fastest and best way to convert JSON data to a tabular data frame depends on whether or not the JSON data has metadata or a nested structure.
3.4.1. Situation 1: No nesting, no metadata#
If the JSON has no nesting - that is, if every key is associated with a single datapoint, and no key is associated with a dictionary that contains additional features - and also does not include metadata, then use the following steps:
- Use - requests.get()to download the raw JSON data (unless you have another way of acquiring the raw data)
- Use - pd.read_json()on the- .textattribute of the output of- requests.get()
The result will be a dataframe in which the columns have the same names as the keys in the JSON file. For example, JSON Placeholder has an example JSON dataset that contains random Latin posts to a blog. This JSON contains no metadata and no nesting. The best way to acquire this dataset and convert it to a dataframe is as follows:
posts = requests.get("https://jsonplaceholder.typicode.com/posts")
posts_df = pd.read_json(posts.text)
posts_df
/var/folders/4w/k9sqqcbx4dxgpjtwgv_1m29h0000gq/T/ipykernel_14273/1312368004.py:2: FutureWarning: Passing literal json to 'read_json' is deprecated and will be removed in a future version. To read from a literal string, wrap it in a 'StringIO' object.
  posts_df = pd.read_json(posts.text)
| userId | id | title | body | |
|---|---|---|---|---|
| 0 | 1 | 1 | sunt aut facere repellat provident occaecati e... | quia et suscipit\nsuscipit recusandae consequu... | 
| 1 | 1 | 2 | qui est esse | est rerum tempore vitae\nsequi sint nihil repr... | 
| 2 | 1 | 3 | ea molestias quasi exercitationem repellat qui... | et iusto sed quo iure\nvoluptatem occaecati om... | 
| 3 | 1 | 4 | eum et est occaecati | ullam et saepe reiciendis voluptatem adipisci\... | 
| 4 | 1 | 5 | nesciunt quas odio | repudiandae veniam quaerat sunt sed\nalias aut... | 
| ... | ... | ... | ... | ... | 
| 95 | 10 | 96 | quaerat velit veniam amet cupiditate aut numqu... | in non odio excepturi sint eum\nlabore volupta... | 
| 96 | 10 | 97 | quas fugiat ut perspiciatis vero provident | eum non blanditiis soluta porro quibusdam volu... | 
| 97 | 10 | 98 | laboriosam dolor voluptates | doloremque ex facilis sit sint culpa\nsoluta a... | 
| 98 | 10 | 99 | temporibus sit alias delectus eligendi possimu... | quo deleniti praesentium dicta non quod\naut e... | 
| 99 | 10 | 100 | at nam consequatur ea labore ea harum | cupiditate quo est a modi nesciunt soluta\nips... | 
100 rows × 4 columns
3.4.2. Situation 2: Nesting, but no metadata#
If the JSON file contains nesting, but no metadata, then the best strategy is to
- Use - requests.get()to download the raw JSON data (unless you have another way of acquiring the raw data)
- Use - json.loads()on the- .textattribute of the output from step 1 to register the data as a list in Python
- Use the - pd.json_normalize()function on the list that is the output of step 2
The pd.json_normalize() function stores every feature in the data in a separate column, no matter how many levels of nesting it must parse to find the feature.
Every column has the same name as the key from which it drew the feature. For features that are nested within other features, pd.json_normalize() uses every key on the path to the datapoint to construct the column name, separated by periods. For example, the users data that we worked with above contains up to three levels of nesting. So the lat data is stored in a column named address.geo.lat, since we had to navigate to “address”, then “geo”, then “lat” in the JSON to find these data:
users = requests.get("https://jsonplaceholder.typicode.com/users")
users_json = json.loads(users.text)
users_df = pd.json_normalize(users_json)
users_df
| id | name | username | phone | website | address.street | address.suite | address.city | address.zipcode | address.geo.lat | address.geo.lng | company.name | company.catchPhrase | company.bs | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Leanne Graham | Bret | Sincere@april.biz | 1-770-736-8031 x56442 | hildegard.org | Kulas Light | Apt. 556 | Gwenborough | 92998-3874 | -37.3159 | 81.1496 | Romaguera-Crona | Multi-layered client-server neural-net | harness real-time e-markets | 
| 1 | 2 | Ervin Howell | Antonette | Shanna@melissa.tv | 010-692-6593 x09125 | anastasia.net | Victor Plains | Suite 879 | Wisokyburgh | 90566-7771 | -43.9509 | -34.4618 | Deckow-Crist | Proactive didactic contingency | synergize scalable supply-chains | 
| 2 | 3 | Clementine Bauch | Samantha | Nathan@yesenia.net | 1-463-123-4447 | ramiro.info | Douglas Extension | Suite 847 | McKenziehaven | 59590-4157 | -68.6102 | -47.0653 | Romaguera-Jacobson | Face to face bifurcated interface | e-enable strategic applications | 
| 3 | 4 | Patricia Lebsack | Karianne | Julianne.OConner@kory.org | 493-170-9623 x156 | kale.biz | Hoeger Mall | Apt. 692 | South Elvis | 53919-4257 | 29.4572 | -164.2990 | Robel-Corkery | Multi-tiered zero tolerance productivity | transition cutting-edge web services | 
| 4 | 5 | Chelsey Dietrich | Kamren | Lucio_Hettinger@annie.ca | (254)954-1289 | demarco.info | Skiles Walks | Suite 351 | Roscoeview | 33263 | -31.8129 | 62.5342 | Keebler LLC | User-centric fault-tolerant solution | revolutionize end-to-end systems | 
| 5 | 6 | Mrs. Dennis Schulist | Leopoldo_Corkery | Karley_Dach@jasper.info | 1-477-935-8478 x6430 | ola.org | Norberto Crossing | Apt. 950 | South Christy | 23505-1337 | -71.4197 | 71.7478 | Considine-Lockman | Synchronised bottom-line interface | e-enable innovative applications | 
| 6 | 7 | Kurtis Weissnat | Elwyn.Skiles | Telly.Hoeger@billy.biz | 210.067.6132 | elvis.io | Rex Trail | Suite 280 | Howemouth | 58804-1099 | 24.8918 | 21.8984 | Johns Group | Configurable multimedia task-force | generate enterprise e-tailers | 
| 7 | 8 | Nicholas Runolfsdottir V | Maxime_Nienow | Sherwood@rosamond.me | 586.493.6943 x140 | jacynthe.com | Ellsworth Summit | Suite 729 | Aliyaview | 45169 | -14.3990 | -120.7677 | Abernathy Group | Implemented secondary concept | e-enable extensible e-tailers | 
| 8 | 9 | Glenna Reichert | Delphine | Chaim_McDermott@dana.io | (775)976-6794 x41206 | conrad.com | Dayna Park | Suite 449 | Bartholomebury | 76495-3109 | 24.6463 | -168.8889 | Yost and Sons | Switchable contextually-based project | aggregate real-time technologies | 
| 9 | 10 | Clementina DuBuque | Moriah.Stanton | Rey.Padberg@karina.biz | 024-648-3804 | ambrose.net | Kattie Turnpike | Suite 198 | Lebsackbury | 31428-2261 | -38.2386 | 57.2232 | Hoeger LLC | Centralized empowering task-force | target end-to-end models | 
3.4.3. Situation 3: Metadata#
Finally, if the JSON file contains metadata (regardless of whether or not the records contain nested data), follow these steps:
- Use - requests.get()to download the raw JSON data (unless you have another way of acquiring the raw data)
- Use - json.loads()on the- .textattribute of the output from step 1 to register the data as a list in Python
- Look at the data using a web-browser, or using the text that appears when the JSON object is called in Python, to decide on the path that is necessary to find the records 
- Use the - pd.json_normalize()function on the list that is the output of step 2. But within this function, use the- record_pathparameter and set it equal to a list with the keys, in order, that lead to the record
- Optionally, use the - metaand- meta_prefixparameters in- pd.json_normalize()to store the metadata in the dataframe
For example, we can access a JSON file for the top 25 top posts at the moment on Reddit’s r/popular page here: http://www.reddit.com/r/popular/top.json. If our goal is to construct a data frame with 25 rows, one for each post, we must find the path that leads to these data. Looking at the web-browser, the top-level has two keys: “kind” and “data”. “kind” is simply metadata telling us that this file contains listings, and the data live in “data”. Within this branch, there are four more metadata branches, “modhash”, “dist”, “before”, and “after”, and the data we need exist within “children”. So the path we need is ["data", "children"]. The code to construct the data frame we need is:
url = "http://www.reddit.com/r/popular/top.json"
reddit = requests.get(url, headers = {'User-agent': 'DS6001'})
reddit_json = json.loads(reddit.text)
reddit_df = pd.json_normalize(reddit_json, record_path = ["data", "children"])
reddit_df
| kind | data.approved_at_utc | data.subreddit | data.selftext | data.author_fullname | data.saved | data.mod_reason_title | data.gilded | data.clicked | data.title | ... | data.media.reddit_video.has_audio | data.media.reddit_video.height | data.media.reddit_video.width | data.media.reddit_video.scrubber_media_url | data.media.reddit_video.dash_url | data.media.reddit_video.duration | data.media.reddit_video.hls_url | data.media.reddit_video.is_gif | data.media.reddit_video.transcoding_status | data.link_flair_template_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | t3 | None | rareinsults | t2_mi0px0qs | False | None | 0 | False | Get them off their high horse | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
| 1 | t3 | None | meirl | t2_lwkxry1qi | False | None | 0 | False | Meirl | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
| 2 | t3 | None | Eyebleach | t2_fr7ssjsq2 | False | None | 0 | False | Oh, hi | ... | True | 1280.0 | 720.0 | https://v.redd.it/1n6bxnv1ot0f1/DASH_96.mp4 | https://v.redd.it/1n6bxnv1ot0f1/DASHPlaylist.m... | 42.0 | https://v.redd.it/1n6bxnv1ot0f1/HLSPlaylist.m3... | False | completed | NaN | |
| 3 | t3 | None | aww | t2_1hpvf3bcdh | False | None | 0 | False | The dog I walk is owned by a Hindu family. Eve... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
| 4 | t3 | None | law | t2_1gcbmqtjbj | False | None | 0 | False | Rep Ramirez : "We don't live in a dictatorship... | ... | True | 360.0 | 640.0 | https://v.redd.it/ni2yh3qwst0f1/DASH_96.mp4 | https://v.redd.it/ni2yh3qwst0f1/DASHPlaylist.m... | 108.0 | https://v.redd.it/ni2yh3qwst0f1/HLSPlaylist.m3... | False | completed | de297c36-dca0-11ee-9618-a26bc20401e6 | |
| 5 | t3 | None | pics | t2_1098fybz | False | None | 0 | False | Ben Cohen, co-founder of Ben and Jerry's, is d... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
| 6 | t3 | None | meirl | t2_1630bts5 | False | None | 0 | False | Meirl | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
| 7 | t3 | None | Fauxmoi | t2_80kmkxzb | False | None | 0 | False | Bruce Springsteen on the opening night of his ... | ... | True | 1280.0 | 720.0 | https://v.redd.it/ybmlsyzrhu0f1/DASH_96.mp4 | https://v.redd.it/ybmlsyzrhu0f1/DASHPlaylist.m... | 57.0 | https://v.redd.it/ybmlsyzrhu0f1/HLSPlaylist.m3... | False | completed | 0467d69a-d6d8-11ef-a9ba-16ae90dacafd | |
| 8 | t3 | None | Music | t2_nl1wwdd | False | None | 0 | False | Bruce Springsteen Calls Donald Trump ‘Incompet... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ac64945c-f5e0-11e4-9b0d-0e141393b029 | |
| 9 | t3 | None | nextfuckinglevel | t2_hc9ju | False | None | 0 | False | This lighted floor game where groups work toge... | ... | True | 638.0 | 360.0 | https://v.redd.it/a39eay44au0f1/DASH_96.mp4 | https://v.redd.it/a39eay44au0f1/DASHPlaylist.m... | 60.0 | https://v.redd.it/a39eay44au0f1/HLSPlaylist.m3... | False | completed | NaN | |
| 10 | t3 | None | funny | t2_xag54 | False | None | 0 | False | Thank you and have a safe trip | ... | True | 1920.0 | 1080.0 | https://v.redd.it/puagbarttu0f1/DASH_96.mp4 | https://v.redd.it/puagbarttu0f1/DASHPlaylist.m... | 82.0 | https://v.redd.it/puagbarttu0f1/HLSPlaylist.m3... | False | completed | NaN | |
| 11 | t3 | None | chaoticgood | t2_80kmkxzb | False | None | 0 | False | RFK Jr. jumps out of his fucking chair as prot... | ... | True | 1080.0 | 1920.0 | https://v.redd.it/5m4h0jmcau0f1/DASH_96.mp4 | https://v.redd.it/5m4h0jmcau0f1/DASHPlaylist.m... | 52.0 | https://v.redd.it/5m4h0jmcau0f1/HLSPlaylist.m3... | False | completed | NaN | |
| 12 | t3 | None | mildlyinfuriating | t2_p8epndi5n | False | None | 0 | False | What does a screen protector have to do with AI? | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
| 13 | t3 | None | Damnthatsinteresting | t2_4us0lrdh | False | None | 0 | False | A world record set for the fastest Rubik’s cub... | ... | True | 854.0 | 480.0 | https://v.redd.it/a1yxyzl1ut0f1/DASH_96.mp4 | https://v.redd.it/a1yxyzl1ut0f1/DASHPlaylist.m... | 28.0 | https://v.redd.it/a1yxyzl1ut0f1/HLSPlaylist.m3... | False | completed | NaN | |
| 14 | t3 | None | Damnthatsinteresting | t2_9wrvogll | False | None | 0 | False | No CGI. 300,000 People. Most extras used in a ... | ... | True | 480.0 | 636.0 | https://v.redd.it/lt9lll68ku0f1/DASH_96.mp4 | https://v.redd.it/lt9lll68ku0f1/DASHPlaylist.m... | 23.0 | https://v.redd.it/lt9lll68ku0f1/HLSPlaylist.m3... | False | completed | NaN | |
| 15 | t3 | None | todayilearned | t2_ojnpnf61 | False | None | 0 | False | TIL HBO didn't submit Alfie Allen (Theon), Car... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
| 16 | t3 | None | interestingasfuck | t2_jma306v5 | False | None | 0 | False | This is why you should NEVER jump into water w... | ... | True | 1280.0 | 720.0 | https://v.redd.it/t1awqd0rxu0f1/DASH_96.mp4 | https://v.redd.it/t1awqd0rxu0f1/DASHPlaylist.m... | 107.0 | https://v.redd.it/t1awqd0rxu0f1/HLSPlaylist.m3... | False | completed | NaN | |
| 17 | t3 | None | BeAmazed | t2_1e3zrmxjdr | False | None | 0 | False | Get a vet in here cause that dog is Sick 🔥 | ... | True | 854.0 | 480.0 | https://v.redd.it/8kkoreimhx0f1/DASH_96.mp4 | https://v.redd.it/8kkoreimhx0f1/DASHPlaylist.m... | 45.0 | https://v.redd.it/8kkoreimhx0f1/HLSPlaylist.m3... | False | completed | d8125494-bc6a-11ef-9d38-9e070e1f4fe9 | |
| 18 | t3 | None | clevercomebacks | t2_1b45x8e6g0 | False | None | 0 | False | Perfect timing so! | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
| 19 | t3 | None | MadeMeSmile | t2_4dxpzq1h | False | None | 0 | False | She invited her bus driver to her birthday party. | ... | True | 1920.0 | 1080.0 | https://v.redd.it/9luej4fbrx0f1/DASH_96.mp4 | https://v.redd.it/9luej4fbrx0f1/DASHPlaylist.m... | 11.0 | https://v.redd.it/9luej4fbrx0f1/HLSPlaylist.m3... | False | completed | beb6337c-b971-11ea-b373-0ee7186c1723 | |
| 20 | t3 | None | comics | t2_5dc66 | False | None | 0 | False | Hoof | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2e1f0fe2-f5fe-11e3-a4cf-12313b0e95bd | |
| 21 | t3 | None | labrador | My wife and I live in a major city and she wor... | t2_2xlqbi3t | False | None | 0 | False | My best boy prevented our house from being rob... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 540763a8-b4c5-11ef-869c-227bf37bc986 | 
| 22 | t3 | None | nottheonion | t2_3meiynf7 | False | None | 0 | False | Abortion Law Forces Doctors to Keep Pregnant B... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
| 23 | t3 | None | SipsTea | t2_lwkxry1qi | False | None | 0 | False | Spitting facts though!! | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 75494202-a1a4-11ec-8b4b-ae20b4018c06 | |
| 24 | t3 | None | meme | t2_1fmi25awxz | False | None | 0 | False | Huh | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 
25 rows × 129 columns
Please note: we added headers = {'User-agent': 'DS6001'} to the requests.get() function. When we use requests.get(), we are accessing an API: a web-based interface for transferring data, usually in the form of JSON files. We will cover APIs in the next module. Most APIs, including Reddit’s, include some security to keep any one user from overusing the API and causing it to crash. For Reddit, we only need to provide a unique “User-agent”, which is an ID that Reddit uses to keep track of how much we are using the API. If two people use the same User-agent, both uses are counted against the same limit, so it’s better to choose a User-agent that has a unique name. If this block of code results in an error for you (so that reddit.text displays as '{"message": "Too Many Requests", "error": 429}'), then change DS6001 to something else, and it should work.
To save the metadata in the data frame, use the meta parameter, set equal to a list of paths for each metadata feature you wish to store in the dataframe. In this case, to save the “kind” feature and the “after” feature within the “data” branch, I type meta = ['kind', ['data', 'after']]. Finally, because there is already a kind feature stored within the records, I must distinguish the metadata feature with a prefix. I use meta_prefix='meta' to place “meta” prior to the column names of the metadata.
Because metadata exist outside of the records, they will be constant across the rows in the dataframe. Here’s the code to convert the Reddit data to a data frame while storing the metadata. Take a look at the resulting dataframe and scroll all the way to the right to see the two metadata columns:
url = "http://www.reddit.com/r/popular/top.json"
reddit = requests.get(url, headers = {'User-agent': 'DS6001'})
reddit_json = json.loads(reddit.text)
reddit_df = pd.json_normalize(reddit_json, 
                              record_path = ["data", "children"], 
                              meta = ['kind', ['data', 'after']], 
                              meta_prefix = "meta")
reddit_df
| kind | data.approved_at_utc | data.subreddit | data.selftext | data.author_fullname | data.saved | data.mod_reason_title | data.gilded | data.clicked | data.title | ... | data.media.reddit_video.width | data.media.reddit_video.scrubber_media_url | data.media.reddit_video.dash_url | data.media.reddit_video.duration | data.media.reddit_video.hls_url | data.media.reddit_video.is_gif | data.media.reddit_video.transcoding_status | data.link_flair_template_id | metakind | metadata.after | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | t3 | None | rareinsults | t2_mi0px0qs | False | None | 0 | False | Get them off their high horse | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Listing | t3_1kn3itk | |
| 1 | t3 | None | meirl | t2_lwkxry1qi | False | None | 0 | False | Meirl | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Listing | t3_1kn3itk | |
| 2 | t3 | None | Eyebleach | t2_fr7ssjsq2 | False | None | 0 | False | Oh, hi | ... | 720.0 | https://v.redd.it/1n6bxnv1ot0f1/DASH_96.mp4 | https://v.redd.it/1n6bxnv1ot0f1/DASHPlaylist.m... | 42.0 | https://v.redd.it/1n6bxnv1ot0f1/HLSPlaylist.m3... | False | completed | NaN | Listing | t3_1kn3itk | |
| 3 | t3 | None | aww | t2_1hpvf3bcdh | False | None | 0 | False | The dog I walk is owned by a Hindu family. Eve... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Listing | t3_1kn3itk | |
| 4 | t3 | None | law | t2_1gcbmqtjbj | False | None | 0 | False | Rep Ramirez : "We don't live in a dictatorship... | ... | 640.0 | https://v.redd.it/ni2yh3qwst0f1/DASH_96.mp4 | https://v.redd.it/ni2yh3qwst0f1/DASHPlaylist.m... | 108.0 | https://v.redd.it/ni2yh3qwst0f1/HLSPlaylist.m3... | False | completed | de297c36-dca0-11ee-9618-a26bc20401e6 | Listing | t3_1kn3itk | |
| 5 | t3 | None | pics | t2_1098fybz | False | None | 0 | False | Ben Cohen, co-founder of Ben and Jerry's, is d... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Listing | t3_1kn3itk | |
| 6 | t3 | None | meirl | t2_1630bts5 | False | None | 0 | False | Meirl | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Listing | t3_1kn3itk | |
| 7 | t3 | None | Fauxmoi | t2_80kmkxzb | False | None | 0 | False | Bruce Springsteen on the opening night of his ... | ... | 720.0 | https://v.redd.it/ybmlsyzrhu0f1/DASH_96.mp4 | https://v.redd.it/ybmlsyzrhu0f1/DASHPlaylist.m... | 57.0 | https://v.redd.it/ybmlsyzrhu0f1/HLSPlaylist.m3... | False | completed | 0467d69a-d6d8-11ef-a9ba-16ae90dacafd | Listing | t3_1kn3itk | |
| 8 | t3 | None | Music | t2_nl1wwdd | False | None | 0 | False | Bruce Springsteen Calls Donald Trump ‘Incompet... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ac64945c-f5e0-11e4-9b0d-0e141393b029 | Listing | t3_1kn3itk | |
| 9 | t3 | None | nextfuckinglevel | t2_hc9ju | False | None | 0 | False | This lighted floor game where groups work toge... | ... | 360.0 | https://v.redd.it/a39eay44au0f1/DASH_96.mp4 | https://v.redd.it/a39eay44au0f1/DASHPlaylist.m... | 60.0 | https://v.redd.it/a39eay44au0f1/HLSPlaylist.m3... | False | completed | NaN | Listing | t3_1kn3itk | |
| 10 | t3 | None | funny | t2_xag54 | False | None | 0 | False | Thank you and have a safe trip | ... | 1080.0 | https://v.redd.it/puagbarttu0f1/DASH_96.mp4 | https://v.redd.it/puagbarttu0f1/DASHPlaylist.m... | 82.0 | https://v.redd.it/puagbarttu0f1/HLSPlaylist.m3... | False | completed | NaN | Listing | t3_1kn3itk | |
| 11 | t3 | None | chaoticgood | t2_80kmkxzb | False | None | 0 | False | RFK Jr. jumps out of his fucking chair as prot... | ... | 1920.0 | https://v.redd.it/5m4h0jmcau0f1/DASH_96.mp4 | https://v.redd.it/5m4h0jmcau0f1/DASHPlaylist.m... | 52.0 | https://v.redd.it/5m4h0jmcau0f1/HLSPlaylist.m3... | False | completed | NaN | Listing | t3_1kn3itk | |
| 12 | t3 | None | mildlyinfuriating | t2_p8epndi5n | False | None | 0 | False | What does a screen protector have to do with AI? | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Listing | t3_1kn3itk | |
| 13 | t3 | None | Damnthatsinteresting | t2_4us0lrdh | False | None | 0 | False | A world record set for the fastest Rubik’s cub... | ... | 480.0 | https://v.redd.it/a1yxyzl1ut0f1/DASH_96.mp4 | https://v.redd.it/a1yxyzl1ut0f1/DASHPlaylist.m... | 28.0 | https://v.redd.it/a1yxyzl1ut0f1/HLSPlaylist.m3... | False | completed | NaN | Listing | t3_1kn3itk | |
| 14 | t3 | None | Damnthatsinteresting | t2_9wrvogll | False | None | 0 | False | No CGI. 300,000 People. Most extras used in a ... | ... | 636.0 | https://v.redd.it/lt9lll68ku0f1/DASH_96.mp4 | https://v.redd.it/lt9lll68ku0f1/DASHPlaylist.m... | 23.0 | https://v.redd.it/lt9lll68ku0f1/HLSPlaylist.m3... | False | completed | NaN | Listing | t3_1kn3itk | |
| 15 | t3 | None | todayilearned | t2_ojnpnf61 | False | None | 0 | False | TIL HBO didn't submit Alfie Allen (Theon), Car... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Listing | t3_1kn3itk | |
| 16 | t3 | None | interestingasfuck | t2_jma306v5 | False | None | 0 | False | This is why you should NEVER jump into water w... | ... | 720.0 | https://v.redd.it/t1awqd0rxu0f1/DASH_96.mp4 | https://v.redd.it/t1awqd0rxu0f1/DASHPlaylist.m... | 107.0 | https://v.redd.it/t1awqd0rxu0f1/HLSPlaylist.m3... | False | completed | NaN | Listing | t3_1kn3itk | |
| 17 | t3 | None | BeAmazed | t2_1e3zrmxjdr | False | None | 0 | False | Get a vet in here cause that dog is Sick 🔥 | ... | 480.0 | https://v.redd.it/8kkoreimhx0f1/DASH_96.mp4 | https://v.redd.it/8kkoreimhx0f1/DASHPlaylist.m... | 45.0 | https://v.redd.it/8kkoreimhx0f1/HLSPlaylist.m3... | False | completed | d8125494-bc6a-11ef-9d38-9e070e1f4fe9 | Listing | t3_1kn3itk | |
| 18 | t3 | None | clevercomebacks | t2_1b45x8e6g0 | False | None | 0 | False | Perfect timing so! | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Listing | t3_1kn3itk | |
| 19 | t3 | None | MadeMeSmile | t2_4dxpzq1h | False | None | 0 | False | She invited her bus driver to her birthday party. | ... | 1080.0 | https://v.redd.it/9luej4fbrx0f1/DASH_96.mp4 | https://v.redd.it/9luej4fbrx0f1/DASHPlaylist.m... | 11.0 | https://v.redd.it/9luej4fbrx0f1/HLSPlaylist.m3... | False | completed | beb6337c-b971-11ea-b373-0ee7186c1723 | Listing | t3_1kn3itk | |
| 20 | t3 | None | comics | t2_5dc66 | False | None | 0 | False | Hoof | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2e1f0fe2-f5fe-11e3-a4cf-12313b0e95bd | Listing | t3_1kn3itk | |
| 21 | t3 | None | labrador | My wife and I live in a major city and she wor... | t2_2xlqbi3t | False | None | 0 | False | My best boy prevented our house from being rob... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 540763a8-b4c5-11ef-869c-227bf37bc986 | Listing | t3_1kn3itk | 
| 22 | t3 | None | nottheonion | t2_3meiynf7 | False | None | 0 | False | Abortion Law Forces Doctors to Keep Pregnant B... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Listing | t3_1kn3itk | |
| 23 | t3 | None | SipsTea | t2_lwkxry1qi | False | None | 0 | False | Spitting facts though!! | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 75494202-a1a4-11ec-8b4b-ae20b4018c06 | Listing | t3_1kn3itk | |
| 24 | t3 | None | meme | t2_1fmi25awxz | False | None | 0 | False | Huh | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Listing | t3_1kn3itk | 
25 rows × 131 columns
3.5. Saving JSON Files and Converting Data Frames to JSON#
It is possible to save a JSON file to your local disk space, or to convert a dataframe to JSON and, if you want to, save that JSON to disk.
3.5.1. Saving Existing JSON Files to Disk#
Once we’ve used the json.loads() function to register data as JSON in Python, we can save that JSON to our local disk space by using open() to create a new file, and json.dump() (note: not json.dumps()) to save the JSON to that file.
For example, we registered the customer data as JSON in Python with the following code:
users = requests.get("https://jsonplaceholder.typicode.com/users")
users_json = json.loads(users.text)
To save this JSON file to my harddrive, I first use os.chdir() to set the working directory to the folder where I want to save my JSON file. (I omit that code here because it won’t work on other people’s computers.) Then I type:
with open('users.json', 'w') as outfile:
     json.dump(users_json, outfile, sort_keys = True, indent = 4,
               ensure_ascii = False)
It is also possible to use a file extension such as .txt instead of .json to save the JSON formatted data in a plain text file.
3.5.2. Converting Tabular DataFrames to JSON#
Any dataframe can be turned into a JSON file by applying the .to_json() method to the dataframe that is saved in Python’s memory. The trick is specifying a good organization for this file. Recall that JSON structures are much more flexible than dataframes. It is not possible to go from a dataframe to a nested, complicated JSON structure because the information about nesting simply does not exist for dataframes. That said, there are several choices for organizing the data:
- orient="records"works with JSON files organized as a list-of-sets, where each set is an entire record (or a row in flat data):
users_df = users_df.loc[0:2,] # just keeping the first 3 records, for display purposes
new_json = users_df.to_json(orient="records")
json.loads(new_json)[0]
{'id': 1,
 'name': 'Leanne Graham',
 'username': 'Bret',
 'email': 'Sincere@april.biz',
 'phone': '1-770-736-8031 x56442',
 'website': 'hildegard.org',
 'address.street': 'Kulas Light',
 'address.suite': 'Apt. 556',
 'address.city': 'Gwenborough',
 'address.zipcode': '92998-3874',
 'address.geo.lat': '-37.3159',
 'address.geo.lng': '81.1496',
 'company.name': 'Romaguera-Crona',
 'company.catchPhrase': 'Multi-layered client-server neural-net',
 'company.bs': 'harness real-time e-markets'}
- orient="columns"works with JSON files organized as a list-of-dictionaries, where each dictionary is an entire column (the names are the row-names in the tabular data)
new_json = users_df.to_json(orient="columns")
json.loads(new_json)
{'id': {'0': 1, '1': 2, '2': 3},
 'name': {'0': 'Leanne Graham', '1': 'Ervin Howell', '2': 'Clementine Bauch'},
 'username': {'0': 'Bret', '1': 'Antonette', '2': 'Samantha'},
 'email': {'0': 'Sincere@april.biz',
  '1': 'Shanna@melissa.tv',
  '2': 'Nathan@yesenia.net'},
 'phone': {'0': '1-770-736-8031 x56442',
  '1': '010-692-6593 x09125',
  '2': '1-463-123-4447'},
 'website': {'0': 'hildegard.org', '1': 'anastasia.net', '2': 'ramiro.info'},
 'address.street': {'0': 'Kulas Light',
  '1': 'Victor Plains',
  '2': 'Douglas Extension'},
 'address.suite': {'0': 'Apt. 556', '1': 'Suite 879', '2': 'Suite 847'},
 'address.city': {'0': 'Gwenborough',
  '1': 'Wisokyburgh',
  '2': 'McKenziehaven'},
 'address.zipcode': {'0': '92998-3874', '1': '90566-7771', '2': '59590-4157'},
 'address.geo.lat': {'0': '-37.3159', '1': '-43.9509', '2': '-68.6102'},
 'address.geo.lng': {'0': '81.1496', '1': '-34.4618', '2': '-47.0653'},
 'company.name': {'0': 'Romaguera-Crona',
  '1': 'Deckow-Crist',
  '2': 'Romaguera-Jacobson'},
 'company.catchPhrase': {'0': 'Multi-layered client-server neural-net',
  '1': 'Proactive didactic contingency',
  '2': 'Face to face bifurcated interface'},
 'company.bs': {'0': 'harness real-time e-markets',
  '1': 'synergize scalable supply-chains',
  '2': 'e-enable strategic applications'}}
- orient="split"works with JSON files organized as dictionary with three lists:- columnslists the column names,- indexlists the row names, and- datais a list-of-lists of data points, one list for each row.
new_json = users_df.to_json(orient="split")
json.loads(new_json)
{'columns': ['id',
  'name',
  'username',
  'email',
  'phone',
  'website',
  'address.street',
  'address.suite',
  'address.city',
  'address.zipcode',
  'address.geo.lat',
  'address.geo.lng',
  'company.name',
  'company.catchPhrase',
  'company.bs'],
 'index': [0, 1, 2],
 'data': [[1,
   'Leanne Graham',
   'Bret',
   'Sincere@april.biz',
   '1-770-736-8031 x56442',
   'hildegard.org',
   'Kulas Light',
   'Apt. 556',
   'Gwenborough',
   '92998-3874',
   '-37.3159',
   '81.1496',
   'Romaguera-Crona',
   'Multi-layered client-server neural-net',
   'harness real-time e-markets'],
  [2,
   'Ervin Howell',
   'Antonette',
   'Shanna@melissa.tv',
   '010-692-6593 x09125',
   'anastasia.net',
   'Victor Plains',
   'Suite 879',
   'Wisokyburgh',
   '90566-7771',
   '-43.9509',
   '-34.4618',
   'Deckow-Crist',
   'Proactive didactic contingency',
   'synergize scalable supply-chains'],
  [3,
   'Clementine Bauch',
   'Samantha',
   'Nathan@yesenia.net',
   '1-463-123-4447',
   'ramiro.info',
   'Douglas Extension',
   'Suite 847',
   'McKenziehaven',
   '59590-4157',
   '-68.6102',
   '-47.0653',
   'Romaguera-Jacobson',
   'Face to face bifurcated interface',
   'e-enable strategic applications']]}
- orient="index"is like- orient="records"but includes the name of each row in the data:
new_json = users_df.to_json(orient="index")
json.loads(new_json)
{'0': {'id': 1,
  'name': 'Leanne Graham',
  'username': 'Bret',
  'email': 'Sincere@april.biz',
  'phone': '1-770-736-8031 x56442',
  'website': 'hildegard.org',
  'address.street': 'Kulas Light',
  'address.suite': 'Apt. 556',
  'address.city': 'Gwenborough',
  'address.zipcode': '92998-3874',
  'address.geo.lat': '-37.3159',
  'address.geo.lng': '81.1496',
  'company.name': 'Romaguera-Crona',
  'company.catchPhrase': 'Multi-layered client-server neural-net',
  'company.bs': 'harness real-time e-markets'},
 '1': {'id': 2,
  'name': 'Ervin Howell',
  'username': 'Antonette',
  'email': 'Shanna@melissa.tv',
  'phone': '010-692-6593 x09125',
  'website': 'anastasia.net',
  'address.street': 'Victor Plains',
  'address.suite': 'Suite 879',
  'address.city': 'Wisokyburgh',
  'address.zipcode': '90566-7771',
  'address.geo.lat': '-43.9509',
  'address.geo.lng': '-34.4618',
  'company.name': 'Deckow-Crist',
  'company.catchPhrase': 'Proactive didactic contingency',
  'company.bs': 'synergize scalable supply-chains'},
 '2': {'id': 3,
  'name': 'Clementine Bauch',
  'username': 'Samantha',
  'email': 'Nathan@yesenia.net',
  'phone': '1-463-123-4447',
  'website': 'ramiro.info',
  'address.street': 'Douglas Extension',
  'address.suite': 'Suite 847',
  'address.city': 'McKenziehaven',
  'address.zipcode': '59590-4157',
  'address.geo.lat': '-68.6102',
  'address.geo.lng': '-47.0653',
  'company.name': 'Romaguera-Jacobson',
  'company.catchPhrase': 'Face to face bifurcated interface',
  'company.bs': 'e-enable strategic applications'}}
- orient="values"only contains the datapoints:
new_json = users_df.to_json(orient="values")
json.loads(new_json)
[[1,
  'Leanne Graham',
  'Bret',
  'Sincere@april.biz',
  '1-770-736-8031 x56442',
  'hildegard.org',
  'Kulas Light',
  'Apt. 556',
  'Gwenborough',
  '92998-3874',
  '-37.3159',
  '81.1496',
  'Romaguera-Crona',
  'Multi-layered client-server neural-net',
  'harness real-time e-markets'],
 [2,
  'Ervin Howell',
  'Antonette',
  'Shanna@melissa.tv',
  '010-692-6593 x09125',
  'anastasia.net',
  'Victor Plains',
  'Suite 879',
  'Wisokyburgh',
  '90566-7771',
  '-43.9509',
  '-34.4618',
  'Deckow-Crist',
  'Proactive didactic contingency',
  'synergize scalable supply-chains'],
 [3,
  'Clementine Bauch',
  'Samantha',
  'Nathan@yesenia.net',
  '1-463-123-4447',
  'ramiro.info',
  'Douglas Extension',
  'Suite 847',
  'McKenziehaven',
  '59590-4157',
  '-68.6102',
  '-47.0653',
  'Romaguera-Jacobson',
  'Face to face bifurcated interface',
  'e-enable strategic applications']]
To save these files to disk, specify a filename (using the os.chdir() function to change the working directory to the folder in which you save to save this file), or a filename and path, for the first parameter:
users_df.to_json("myjson.json", orient="values")
