8. The Metropolitan Museum of Art Collection (over 450000 works of art)
https://www.metmuseum.org/art/the-collection
9. 9
Which tool should we use to
work with the MET dataset?
▶ Dataset on Kaggle is 36GB
▶ A typical laptop has 8GB of RAM
▶ 5-10X as much RAM as the dataset size
Add item
or phrase
we ❤️
pandas,
but what
do we do
when it
breaks?
10. 10
Which tool should we use to
work with the MET dataset?
▶ Database engines are built to handle
data of any size
▶ Many data scientists use Python
▶ Ibis generates SQL and the database
engine (backend) does the heavy lifting
Add item
or phrase
generate
sql and let
the engine
do the
lifting🦾
11. 11
- pandas-like
- compiles to sql
(executed efficiently)
* anything you can
write with a SELECT
statement you can
write in ibis
12. 12
The MET on Google BigQuery
▶ 200,000 art pieces
▶ Data hosted in the cloud publicly
▶ Data is consistently updated
▶ Uses SQL to query