Revision: Sat, 02 Jul 2022 10:45:30 GMT

Cookbook - Website Scraper

Note The documentation page contains outdated information relevant to RoadRunner 1.x only.

We can use Spiral Queue and RoadRunner server to implement applications different from classic web setup. In this tutorial, we will try to implement a simple web-scraper application for CLI usage.

The scraped data will be stored in a runtime folder.

The produced code only demonstrates the capabilities and can be improved a lot.

Installing Dependencies

We will base our application on spiral/app-cli - the minimalistic spiral build without ORM, HTTP, and other extensions.

$ composer create-project spiral/app-cli scraper
$ cd scraper

To implement all needed features we will need a set of extensions:

Extension Comment
spiral/jobs Queue support
spiral/scaffolder Faster scaffolding (dev only)
spiral/prototype Faster prototyping (dev only)
paquettg/php-html-parser Parsing HTML

To install all needed packages and download app server:

spiral/framework < 2.7.0

$ composer require spiral/jobs spiral/scaffolder spiral/prototype paquettg/php-html-parser

spiral/framework >= 2.7.0

$ composer require spiral/jobs paquettg/php-html-parser

Then execute

$ ./vendor/bin/spiral get

Activate the installed extensions in your App\App:

namespace App;

use Spiral\Bootloader;
use Spiral\Framework\Kernel;
use Spiral\Prototype\Bootloader as Prototype;
use Spiral\Scaffolder\Bootloader\ScaffolderBootloader;

class App extends Kernel
    protected const LOAD = [


Make sure to run php app.php configure to ensure proper installation.

Configure App Server

Let's configure the application server with one default queue in memory. Create .rr.yaml file in the root of the project:

    # dispatch all the App\Job\* into local pipeline
    app-job-*.pipeline: "local"

    # associate local pipeline with `ephemeral` (in-memory) broker "ephemeral"

  # consume local pipeline
  consume: ["local"]

  # endpoint and worker count
    command: "php app.php"

    # increase number of workers for IO-bound applications
    pool.numWorkers: 16

Create Job Handler

Now, let's write a simple job handler which will scan the website, get the HTML content, and jump by links util the specific depth reached. All the content will be stored in runtime directory.

Create JobHandler via php app.php create:job scrape. We are not going to use CURL for simplicity.

namespace App\Job;

use \PHPHtmlParser\Dom;
use Spiral\Jobs\JobHandler;
use Spiral\Prototype\Traits\PrototypeTrait;

class ScrapeJob extends JobHandler
    use PrototypeTrait;

    public function invoke(int $depth, string $url): void
        if ($this->files->exists(directory('runtime') . md5($url) . '.html')) {
            // already scrapped

        $body = file_get_contents($url);
        $this->store($url, $body);

        $dom = new Dom();

        foreach ($dom->find('a') as $a) {
            $next = $this->nextURL($url, $a->href);

            if ($next !== null && $depth > 1) {
                $this->queue->push(self::class, ['depth' => $depth - 1, 'url' => $next]);

    private function store(string $url, string $body)
        $this->files->write(directory('runtime') . md5($url) . '.html', $body);

            directory('runtime') . 'scrape.log',
            sprintf("%s,%s,%s\n", date('c'), md5($url), $url)

    private function nextURL(string $base, ?string $target): ?string
        if ($target == null) {
            return null;

        $base_url = parse_url($base);
        $target_url = parse_url($target);

        if (isset($target_url['scheme']) && isset($target_url['host'])) {
            if ($target_url['host'] !== $base_url['host']) {
                // only same domain
                return null;

            // full path
            return $target;

        if (!isset($target_url['path'])) {
            return null;

        // relative path
        return sprintf("%s://%s%s", $base_url['scheme'], $base_url['host'], $target_url['path']);

Create command

Create a Command to start scraping php app.php create:command scrape:

namespace App\Command;

use App\Job\ScrapeJob;
use Spiral\Console\Command;
use Spiral\Jobs\QueueInterface;
use Symfony\Component\Console\Input\InputArgument;

class ScrapeCommand extends Command
    public const NAME        = 'scrape';
    public const DESCRIPTION = 'Scape the page';
    public const ARGUMENTS   = [
        ['url', InputArgument::REQUIRED, 'Url to scape'],
        ['depth', InputArgument::OPTIONAL, 'Depth of scanning', 10],

    protected function perform(QueueInterface $queue): void
        $queue->push(ScrapeJob::class, [
            'url'   => $this->argument('url'),
            'depth' => (int)$this->argument('depth')


Test it

Launch application server first:

$ ./spiral serve

Scape any URL via console command (keep the server running):

$ php app.php scrape 5

To observe how many pages scraped via interactive console:

$ ./spiral jobs:stat -i

The demo solution will scan some pages multiple times, use a proper database or lock mechanism to avoid that.

Edit this page