"The Art of Web Scraping: Tree Algorithms and JS Magic", Dmytro Tarasenko

fwdays 23 views 30 slides Oct 18, 2025
Slide 1
Slide 1 of 30
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30

About This Presentation

In this talk, Dmytro will share how he created a DSL (domain-specific language) for parsing HTML pages, designed specifically for web scraping. We’ll dive into tree algorithms, JavaScript code optimization, the DOM, and try it all out with some live coding.


Slide Content

problem problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions H ow to g et product name, price and an image from HTML?

problem problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions

problem or H ow to g et product name, price and an image from HTML? Imagine you have HTML like this … or like this problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions

problem H ow to g et product name, price and an image from HTML? CSS Query problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions XPath RegExp AI based Visual editors

problem Probably your code… problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions

problem or problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions

solution or DSL use simple DSL language to describe data we want to grab problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions

harvester lib HTML t ree-like template h arvester t emplate parser fuzzy tree match data JSON JSON tree a s a string problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions JSON Object

h arvester:template spaces : 2 spaces = 1 level tags : div, * types : int, float, with, func, str, empty,… problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions

h arvester:parser t emplate parser problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions

h arvester:fuzzy tree match r e с ursions hell c ombinations d eep DOM s earch time s imilar tags e xp complexity p artial equality hard to debug PROBLEMS problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions

h arvester:match algo l m k tpl DOM klm 123, 245, 367, 378, 368,… 145, 167, 178, 168 19a, 1bc, 1de, 1fg 245, 367, 378, 368 29a, 2bc, 3de, 3fg 49a, 5bc, 7de, 8fg 12, 13 14, 15, 16, 17, 18 19, 1a, 1b, 1c, 1d,… 24, 25, 36, 37, 38 29, 2a, 2b, 2c, 3d, 3e,… 49, 4a, 5b, 5c, 7d, 7e,… kl,km 1 2, 3 4, 5, 6, 7, 8 9, a, b, c, d, e, f, g k,l,m lm 23 45, 67, 78, 68 9a, bc, de, fg 2 3 1 4 5 6 7 9 a d e b c 8 f g is searched in a problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions

h arvester:match algo 1 2 3 FOR EVERY NODE! recursion recursion recursion problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions

problem :: solution :: harvester :: h arvester:score 1+74+8 1+12+7 nodeScore === if tag : +1 if text & textTag : +6 if text & textType : +12 if text & !textType: -12 if attr & textAttr : +6 1+12+7 1+6+7 1+12+7 S core is an indicator of how close the branches of the target tree are to each other. Our goal is to maximize the score 83 points maxScore : problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions maxS core = nodeScore + inScore + maxLevel - level

problem :: solution :: harvester :: h arvester:harvest() problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions maxScore found score

h arvester: perf > ~x 130 000 speed increase after refactoring : go outside the DOM root go not deeper than tree size copy() function optimisations cache score & compare it with maxScore subsets() function to return nodes from max len to min compare current subset’s score and maxScore cache: tagName, textContent, parentNode, firstElementChild, nextElementSibling level = round(level * 1.618033 ) for every deeper/upper node many small optimizations… problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions

h arvester: perf:cache Getting DOM elements and attributes are slow, so we put it in a cache c ache is used for: score, tagName, tagContent, parentNode, firstElementChild, nextElementSibling problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions

h arvester: perf:cache a ll children’s score prevents deep search p arent nodes cache search lower than p arent problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions

h arvester: ai In most cases, the AI was wrong in choosing the algorithm and optimization approaches. When asked to write basic code for fuzzy tree comparison, the AI wrote a simple brute-force approach, which is the slowest one. But AI generated many good ideas… problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions chatGPT Gemini CoPilot Grok

h arvester: ai problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions

demo:perf on rozetka.com.ua problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions

demo:puppeteer:news problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions on pravda.com.ua

problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions demo:puppeteer:rozetka on rozetka.com.ua

problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions demo:puppeteer:amazon on amazon.com

harvest() // works in browser & keeps core logic harvestPage() // harvest s 1 rec for Puppeteer | Playwright harvestPageAll() // harvests many recs for Puppeteer | Playwright API summary problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions

summary problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions advantages declarative (readability) f ind nodes without ids tiny size, only 791 lines robustness to DOM changes getting data in fuzzy DOM it is pleasurably fast get all data per one call supports text data types with Puppeteer / Playwright

problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions plans summary add more types promotion plans add cheerio support add playwright support add puppeteer support harvestPageAll(), harvestPage(),

github github.com /tmptrash/harvester npmjs.com /package/js-harvester problem :: solution :: harvester :: tpl :: parser :: match :: algo :: score :: harvest :: perf :: cache :: ai :: demo :: summary :: github :: questions

questions?