#it'd sure be nice if tumblr suggested the most recently used tags | Explore Tumblr posts and blogs

blech · 8 months ago

Text

tumblr-backup and datasette

I've been using tumblr_backup, a script that replicates the old Tumblr backup format, for a while. I use it both to back up my main blog and the likes I've accumulated; they outnumber posts over two to one, it turns out.

Sadly, there isn't an 'archive' view of likes, so I have no idea what's there from way back in 2010, when I first really heavily used Tumblr. Heck, even getting back to 2021 is hard. Pulling that data to manipulate it locally seems wise.

I was never quite sure it'd backed up all of my likes, and it turns out that a change to the API was in fact limiting it to the most recent 1,000 entries. Luckily, someone else noticed this well before I did, and a new version, tumblr-backup, not only exists, but is a Python package, which made it easy to install and run. (You do need an API key.)

I ran it using this invocation, which saved likes (-l), didn't download images (-k), skipped the first 1,000 entries (-s 1000), and output to the directory 'likes/full' (-O):

tumblr-backup -j -k -l -s 1000 blech -O likes/full

This gave me over 12,000 files in likes/full/json, one per like. This is great, but a database is nice for querying. Luckily, jq exists:

jq -s 'map(.)' likes/full/json/*.json > likes/full/likes.json

This slurps (-s) in every JSON file, iterates over them to make a list, and then saves it in a new JSON file, likes.json. There was a follow-up I did to get it into the right format for sqlite3:

jq -c '.[]' likes/full/likes.json > likes/full/likes-nl.json

A smart reader can probably combine those into a single operator.

Using Simon Willison's sqlite-utils package, I could then load all of them into a database (with --alter because the keys of each JSON file vary, so the initial column setup is incomplete):

sqlite-utils insert likes/full/likes.db lines likes/full/likes-nl.json --nl --alter

This can then be fed into Willison's Datasette for a nice web UI to query it:

datasette serve --port 8002 likes/full/likes.d

There are a lot of columns there that clutter up the view: I'd suggest this is a good subset (it also shows the post with most notes (likes, reblogs, and comments combined) at the top):

select rowid, id, short_url, slug, blog_name, date, timestamp, liked_timestamp, caption, format, note_count, state, summary, tags, type from lines order by note_count desc limit 101

Happy excavating!

#post #tumblr #backup #likes #data munging #python #coding #husk:front

2 notes · View notes