Readability - Webページからゴチャゴチャを消して本文だけ抽出、これは時短になる

はじめに

Webページをスクレイピングしたいとき、一番面倒なのが「本文以外のゴミ」をどう処理するか、という話なんですよね。

広告、サイドバー、ナビゲーション、関連記事...正直なところ、欲しいのは本文だけなのに、HTML全体を見ると余計なものが多すぎる。これを手作業でCSSセレクタ指定して取り除くのは、サイトごとに違うから現実的じゃない。

そんな悩みを一発で解決してくれるのがReadabilityです。

これ、実はFirefoxの「リーダービュー」機能に使われているのと同じライブラリの独立版なんですよ。あのボタン押すとページがスッキリ読みやすくなるやつ。あの技術がそのまま使えます。

Readabilityとは

ReadabilityはMozillaが開発・メンテナンスしている、Webページから本文コンテンツを抽出するためのJavaScriptライブラリです。

GitHubで8,600スター以上を獲得しており、Firefoxのリーダービュー機能のコアとして長年使われてきた実績があります。

個人的には、「あのFirefoxが実際に使っている」という点が信頼度高いと思っています。毎日何百万人が使うブラウザで動いている技術ですからね。

公式リポジトリ: https://github.com/mozilla/readability

特徴・メリット

1. 本文抽出の精度が高い

機械学習ではなく、ヒューリスティック（経験則ベース）なアルゴリズムで本文を検出します。

テキストの密度
HTML構造（article, main, contentなどのタグ）
クラス名やID（content, article, mainなど）
リンクの密度（広告はリンクが多い傾向）

これらを総合的に判断して、「ここが本文だろう」と推定してくれます。

30代になって思うのは、ヒューリスティックって意外と馬鹿にできないということ。完璧ではないけど、大多数のサイトでちゃんと動くなら実用上は十分なんですよね。

2. 多機能な抽出結果

parse()メソッドを呼ぶと、以下の情報がまとめて取得できます。

title: 記事タイトル
content: 本文HTML
textContent: プレーンテキスト版の本文
excerpt: 抜粋（メタディスクリプション的なもの）
byline: 著者情報
siteName: サイト名
lang: 言語

これ、意外と便利なんですよ。特にtextContentがあるおかげで、LLMに食わせるときにHTMLタグを別途除去する必要がない。

3. セキュリティを意識した設計

公式ドキュメントでも強調されていますが、信頼できない入力に対してはDOMPurifyなどのサニタイザーと併用することが推奨されています。

抽出したHTMLをそのまま表示する場合、XSS対策は必須。これを最初から明記してくれているのは好感が持てますね。

インストール方法

npmからインストールできます。

npm install @mozilla/readability

ブラウザで使う場合はそのまま動きますが、Node.jsで使う場合はjsdomが必要です。ReadabilityはDOMオブジェクトを引数に取るので、Node.jsではjsdomでDOMを生成してから渡します。

npm install @mozilla/readability jsdom

基本的な使い方

ブラウザでの使用

ブラウザ環境では、documentオブジェクトをそのまま渡せます。

import { Readability } from '@mozilla/readability';

// 現在のページから本文を抽出
const article = new Readability(document.cloneNode(true)).parse();

console.log(article.title);       // 記事タイトル
console.log(article.content);     // 本文HTML
console.log(article.textContent); // プレーンテキスト

重要: document.cloneNode(true)でクローンを渡すこと。Readabilityは渡されたDOMを破壊的に変更するので、元のページに影響を与えないためです。

Node.jsでの使用

Node.jsではjsdomと組み合わせて使います。

import { Readability } from '@mozilla/readability';
import { JSDOM } from 'jsdom';

async function extractArticle(url) {
  // HTMLを取得
  const response = await fetch(url);
  const html = await response.text();

  // jsdomでDOMを生成
  const dom = new JSDOM(html, { url });

  // Readabilityで本文抽出
  const reader = new Readability(dom.window.document);
  const article = reader.parse();

  return article;
}

// 使用例
const article = await extractArticle('https://example.com/article');
console.log(article.title);
console.log(article.textContent);

コンストラクタオプション

Readabilityのコンストラクタには、いくつかのオプションがあります。

const reader = new Readability(document, {
  debug: false,           // デバッグログを出力するか
  charThreshold: 500,     // 記事に必要な最小文字数
  keepClasses: false,     // クラス属性を保持するか
  disableJSONLD: false,   // JSON-LD解析をスキップするか
});

charThresholdは地味に便利で、短すぎるコンテンツを「これは記事じゃない」と判定させることができます。

事前チェック: isProbablyReaderable

「このページ、そもそもリーダービューで読めるの？」を事前に判定できます。

import { isProbablyReaderable } from '@mozilla/readability';

if (isProbablyReaderable(document)) {
  const article = new Readability(document).parse();
  // 本文を表示
} else {
  // リーダービュー非対応と判断
}

これ、ユーザーにリーダービューボタンを見せるかどうかの判定に使えます。

実践的なユースケース

1. 記事保存アプリ（Pocketクローン）

import { Readability } from '@mozilla/readability';
import { JSDOM } from 'jsdom';
import DOMPurify from 'dompurify';

async function saveArticle(url) {
  const response = await fetch(url);
  const html = await response.text();

  const dom = new JSDOM(html, { url });
  const reader = new Readability(dom.window.document);
  const article = reader.parse();

  if (!article) {
    throw new Error('記事の抽出に失敗しました');
  }

  // XSS対策としてサニタイズ
  const purify = DOMPurify(dom.window);
  const cleanContent = purify.sanitize(article.content);

  // DBに保存
  await db.articles.create({
    url,
    title: article.title,
    content: cleanContent,
    excerpt: article.excerpt,
    savedAt: new Date(),
  });

  return article;
}

2. RSSリーダーでの全文取得

RSSフィードの本文が抜粋だけのとき、元ページから全文を取得する。

async function fetchFullContent(feedItem) {
  const dom = new JSDOM(await fetch(feedItem.link).then(r => r.text()), {
    url: feedItem.link,
  });

  const article = new Readability(dom.window.document).parse();

  return {
    ...feedItem,
    fullContent: article?.textContent || feedItem.content,
  };
}

3. LLM用のコンテンツ前処理

Webページの内容をLLMに要約させる前に、ノイズを除去する。

async function prepareForLLM(url) {
  const dom = new JSDOM(await fetch(url).then(r => r.text()), { url });
  const article = new Readability(dom.window.document).parse();

  if (!article) {
    throw new Error('コンテンツを抽出できませんでした');
  }

  // プレーンテキストを返す（HTMLタグ不要）
  return {
    title: article.title,
    text: article.textContent,
    wordCount: article.textContent.split(/\s+/).length,
  };
}

コスパ的に、LLMに無駄なトークンを食わせるのは避けたいですからね。

4. ブラウザ拡張機能でのリーダーモード

// content.js
chrome.runtime.onMessage.addListener((request, sender, sendResponse) => {
  if (request.action === 'extractArticle') {
    const article = new Readability(document.cloneNode(true)).parse();
    sendResponse(article);
  }
});