{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 💾 Save to Parquet\n", "\n", "The Parquet format is an efficient file format to save Synoptic's data.\n", "If I need to reuse data over many times (i.e., researching a case study) then I don't want to keep asking Synoptic for the data; I want to get the data and save it to local disk. Also, Synoptic restricts how much data you can retrieve in a single API request. If you need long a long time series then you will need to make multiple API calls. You should save the DataFrame information to a Parquet file to save disk space and most performant loading time.\n", "\n", "**_What are the benefits of saving Synoptic's data to Parquet instead of the raw JSON?_**\n", "\n", "To demonstrate the benefits of Parquet, let's collect a timeseries of 5 days of data for all the stations within 10 miles of WBB.\n", "\n", "1. Write the raw JSON to a JSON file.\n", "1. Write the Polars DataFrame to a Parquet file.\n", "\n", "- How large is the JSON file versus Parquet file? _Parquet is about 18x smaller than JSON, because it is efficiently compressed._\n", "- How long does it take to load a JSON file versus Parquet file? _Parquet is faster to load into memory, it's already organized in a clean table, you can read only select rows if you want, and it's easy to read multiple files at a time._\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "from datetime import timedelta\n", "\n", "import synoptic\n", "import polars as pl" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "🚚💨 Speedy delivery from Synoptic timeseries service.\n", "📦 Received data from 91 stations.\n", "Number of rows: 1,425,573\n" ] } ], "source": [ "s = synoptic.TimeSeries(radius=\"wbb,10\", recent=timedelta(days=5))\n", "print(f\"Number of rows: {len(s.df()):,}\")" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "JSON file size: 43.81 MB\n", " Parquet size: 2.29 MB\n" ] } ], "source": [ "import json\n", "from pathlib import Path\n", "\n", "filepath = Path(\"sample_timeseries.json\")\n", "parquet = filepath.with_suffix(\".parquet\")\n", "\n", "# Write raw data to JSON\n", "with open(filepath, \"w\") as f:\n", " json.dump(s.json, f, indent=4)\n", "\n", "\n", "# Write DataFrame to Parquet\n", "s.df().write_parquet(parquet)\n", "\n", "print(f\"JSON file size: {filepath.stat().st_size / 1000 / 1000:>5.2f} MB\")\n", "print(f\" Parquet size: {parquet.stat().st_size / 1000 / 1000:>5.2f} MB\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Wow, for 1.4 million observations Parquet is more than 19x smaller than the raw JSON. That's impressive.\n", "\n", "Reading Parquet is also fast, plus it's already in a DataFrame and we don't need to parse the JSON again." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 357 ms, sys: 200 ms, total: 557 ms\n", "Wall time: 551 ms\n" ] } ], "source": [ "%%time\n", "# Read the JSON file\n", "with open(filepath, \"r\") as json_file:\n", " data = json.load(json_file)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 543 ms, sys: 531 ms, total: 1.07 s\n", "Wall time: 303 ms\n" ] } ], "source": [ "%%time\n", "_ = pl.read_parquet(parquet)" ] } ], "metadata": { "kernelspec": { "display_name": "synoptic2", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.5" } }, "nbformat": 4, "nbformat_minor": 2 }