Load Parquet Data from S3 to Pandas DataFrame
Question
I have a parquet dataset stored in AWS S3 and want to access it in a Metaflow flow. How can I read one or several Parquet files at once from a flow and use them in a pandas DataFrame?
Solution
1Access Parquet Data in S3
You can access a parquet dataset on S3 in a Metaflow flow using the metaflow.S3
functionalities and load it into a pandas DataFrame for analysis.
To access one file you can use metaflow.S3.get
. Often times Parquet datasets have many files which is a good use case for Metaflow's s3.get_many
function.
This example uses Parquet data stored in S3 from Ookla Global's AWS Open Data Submission.
2Run Flow
This flow shows how to:
- Download multiple Parquet files using Metaflow's
s3.get_many
function. - Read the result of one year of the dataset as a Pandas dataframe.
load_parquet_to_pandas.py
from metaflow import FlowSpec, step, S3
BASE_URL = 's3://ookla-open-data/' + \
'parquet/performance/type=fixed/'
YEARS = ['2019', '2020', '2021', '2022']
S3_PATHS = [
f'year={y}/quarter=1/{y}-' + \
'01-01_performance_fixed_tiles.parquet'
for y in YEARS
]
class ParquetPandasFlow(FlowSpec):
@step
def start(self):
self.next(self.load_parquet)
@step
def load_parquet(self):
import pandas as pd
with S3(s3root=BASE_URL) as s3:
tmp_data_path = s3.get_many(S3_PATHS)
first_path = tmp_data_path[0].path
self.df = pd.read_parquet(first_path)
self.next(self.end)
@step
def end(self):
print('DataFrame for first year' + \
f'has shape {self.df.shape}.')
if __name__ == '__main__':
ParquetPandasFlow()
python load_parquet_to_pandas.py run
3Access Artifacts Outside of Flow
The following can be run in any script or notebook to access the contents of the DataFrame that was stored as a flow artifact with self.df
.
from metaflow import Flow
Flow('ParquetPandasFlow').latest_run.data.df.head()
quadkey | tile | avg_d_kbps | avg_u_kbps | avg_lat_ms | tests | devices | |
---|---|---|---|---|---|---|---|
0 | 0231113112003202 | POLYGON((-90.6591796875 38.4922941923613, -90.... | 66216 | 12490 | 13 | 28 | 4 |
1 | 1322111021111001 | POLYGON((110.352172851562 21.2893743558604, 11... | 102598 | 37356 | 13 | 15 | 4 |
2 | 3112203030003110 | POLYGON((138.592529296875 -34.9219710361638, 1... | 24686 | 18736 | 18 | 162 | 106 |
3 | 0320000130321312 | POLYGON((-87.637939453125 40.225024210605, -87... | 17674 | 13989 | 78 | 364 | 4 |
4 | 0320001332313103 | POLYGON((-84.7430419921875 38.9209554204673, -... | 441192 | 218955 | 22 | 14 | 1 |
Further Reading
- Working with cloud data in Metaflow
- Scaling Metaflow flows