From fdf7c1fb1ada2a87ecf65abefb0cbf7611587bda Mon Sep 17 00:00:00 2001 From: jbkela <299201617+jbkela@users.noreply.github.com> Date: Thu, 2 Jul 2026 20:36:14 +0000 Subject: [PATCH] Python solution for the artworks carousel challenge --- README.md | 195 ++++++++- carousel_parser.py | 245 +++++++++++ conftest.py | 6 + instructions/README.md | 28 ++ pytest.ini | 3 + requirements.txt | 3 + test_parser.py | 195 +++++++++ tests/fixtures/monet_carousel.expected.json | 41 ++ tests/fixtures/monet_carousel.html | 71 +++ .../fixtures/monet_paintings_fr.expected.json | 384 +++++++++++++++++ tests/fixtures/monet_paintings_fr.html | 80 ++++ .../fixtures/picasso_paintings.expected.json | 349 +++++++++++++++ tests/fixtures/picasso_paintings.html | 78 ++++ tests/fixtures/power_cast.expected.json | 403 ++++++++++++++++++ tests/fixtures/power_cast.html | 51 +++ tests/fixtures/star_wars_movies.expected.json | 36 ++ tests/fixtures/star_wars_movies.html | 47 ++ .../vangogh_aria_layout.expected.json | 30 ++ tests/fixtures/vangogh_aria_layout.html | 45 ++ 19 files changed, 2273 insertions(+), 17 deletions(-) create mode 100644 carousel_parser.py create mode 100644 conftest.py create mode 100644 instructions/README.md create mode 100644 pytest.ini create mode 100644 requirements.txt create mode 100644 test_parser.py create mode 100644 tests/fixtures/monet_carousel.expected.json create mode 100644 tests/fixtures/monet_carousel.html create mode 100644 tests/fixtures/monet_paintings_fr.expected.json create mode 100644 tests/fixtures/monet_paintings_fr.html create mode 100644 tests/fixtures/picasso_paintings.expected.json create mode 100644 tests/fixtures/picasso_paintings.html create mode 100644 tests/fixtures/power_cast.expected.json create mode 100644 tests/fixtures/power_cast.html create mode 100644 tests/fixtures/star_wars_movies.expected.json create mode 100644 tests/fixtures/star_wars_movies.html create mode 100644 tests/fixtures/vangogh_aria_layout.expected.json create mode 100644 tests/fixtures/vangogh_aria_layout.html diff --git a/README.md b/README.md index 4d5a093f..e92e448e 100644 --- a/README.md +++ b/README.md @@ -1,28 +1,189 @@ -# Extract Van Gogh Paintings Code Challenge +# Google artworks carousel extractor -Goal is to extract a list of Van Gogh paintings from the attached Google search results page. +Small Python script that pulls the artworks carousel (the row of paintings, or +people, that Google shows at the top of some searches) out of a saved results +page and gives you back an array. It only reads the local HTML file, it doesn't +make any requests. - +Output looks like this: -## Instructions +```json +{ "artworks": [ { "name": "The Starry Night", "extensions": ["1889"], "link": "https://www.google.com/search?...", "image": "data:image/jpeg;base64,..." } ] } +``` -This is already fully supported on SerpApi. ([relevant test], [html file], [sample json], and [expected array].) -Try to come up with your own solution and your own test. -Extract the painting `name`, `extensions` array (date), and Google `link` in an array. +## Install -Fork this repository and make a PR when ready. +```bash +pip install -r requirements.txt +``` -Programming language wise, Ruby (with RSpec tests) is strongly suggested but feel free to use whatever you feel like. +You really only need beautifulsoup4. It'll use lxml if you've got it (bit +faster), and if not it just falls back to Python's built-in html.parser. pytest +is only for the tests. -Parse directly the HTML result page ([html file]) in this repository. No extra HTTP requests should be needed for anything. +## Run -[relevant test]: https://github.com/serpapi/test-knowledge-graph-desktop/blob/master/spec/knowledge_graph_claude_monet_paintings_spec.rb -[sample json]: https://raw.githubusercontent.com/serpapi/code-challenge/master/files/van-gogh-paintings.json -[html file]: https://raw.githubusercontent.com/serpapi/code-challenge/master/files/van-gogh-paintings.html -[expected array]: https://raw.githubusercontent.com/serpapi/code-challenge/master/files/expected-array.json +```bash +python carousel_parser.py files/van-gogh-paintings.html +``` -Add also to your array the painting thumbnails present in the result page file (not the ones where extra requests are needed). +That prints the `{"artworks": [...]}` JSON to the terminal. -Test against 2 other similar result pages to make sure it works against different layouts. (Pages that contain the same kind of carrousel. Don't necessarily have to be paintings.) +## Test -The suggested time for this challenge is 4 hours. But, you can take your time and work more on it if you want. +```bash +pytest +``` + +## What the task wanted + +Just so I didn't miss anything, here's what the +[instructions](instructions/README.md) asked for and what I did: + +- Get the `name`, the `extensions` (the date), and the Google `link` for each one. +- Also grab the thumbnails that are actually in the page, and skip the ones that + would need another request. Those come out as `null`. +- Read the HTML file directly, no extra HTTP calls. +- Write my own test. There's a fair few in `test_parser.py`. +- Try it on a couple of other similar pages. I saved a few real ones, see below. +- They suggest Ruby but say use whatever. I went with Python as that's what I'm + most comfortable in. + +The output matches their `files/expected-array.json` exactly, base64 images and +all. + +## How it works + +Couple of things about the HTML that aren't obvious until you actually look at it. + +### Finding the items + +Each item in the carousel is the same little block repeated: + +```html +
+``` + +The class names (`iELo6`, `pgNMRc` and so on) are just random-looking hashes and +Google changes them between pages, so I didn't want to rely on those. What seems +stable is the shape: a link pointing at `/search`, with a `stick=` bit in it, +that has an image inside it. On the sample page that gets you exactly the 47 items +and leaves out the "More"/"See more" links (they have `stick=` too but no image, +so there's actually 56 of those links and only 47 real items). + +One thing that caught me out: I first checked `href.startswith("/search")`, but +when you save a page the browser rewrites the links to the full +`https://www.google.com/search?...`, so that missed everything. I switched to +checking the URL path is `/search` and now both work. + +If there's a `