相互情報量(mutual information)についてのメモ

推薦の勉強の一環のメモです。

相互情報量とは、単語Aが出たら単語Bも文書Xに出るという
情報量の計算に使えそうな理論。
これをクラスタリングする際の指標に使えないかと模索中です。

数式は下記の通り。

数Aとか数Bが超苦手だったので、
解釈があっているのかアレなのですが、
この式をこんな風に解釈できるんじゃないかと考えてみました。
(正しい数式じゃないのでご注意ください)



相互情報量 = log(

  (全文書に出現した単語の総数 * 単語Aと単語Bの共起頻度) / (単語Aの出現頻度 * 単語Bの出現頻度)

)

多分色々間違ってるような気もするけど。
で、作ったのが以下のSQL。

insert into term_matual_infos(
select
  t1.id,
  t2.id,
  log(
    -- T1
    (
      -- 語の総数
      (select count(id) from terms)
      *
      -- AとBの共起頻度
      (
        select
          -- term_id,
          -- b.name,
          count(term_id) / (select count(id) from documents) as co_cnt
        from document_terms a
        join terms b
          on a.term_id = b.id
        where
          document_id in (
            select document_id from document_terms where term_id = t1.id
          )   
          and term_id != t1.id
          and term_id = t2.id
        group by term_id order by co_cnt desc
        limit 1
      )   
    )   
    /   
    -- T2
    (   
      -- Aの出現頻度
      (   
        select
          (select count(term_id) from document_terms where term_id = t1.id)
          /
          (select count(id) from documents)
      )       
      *   
      -- Bの出現頻度
      (
        select
          (select count(term_id) from document_terms where term_id = t2.id)
          /
          (select count(id) from documents)
      )
    )
  ) as mutual_information
from
  terms t1
join
  terms t2
  on t2.id != t1.id
where
  t1.idf > 2
  and
  t2.idf > 2
);

テーブルの定義は以下の通り。

-- 文書テーブル
CREATE TABLE `documents` (
  `id` int(11) NOT NULL auto_increment,
  `title` text,
  `document` longtext,
  `n` float default NULL COMMENT '文書正規化用(コサイン正規化)',
  PRIMARY KEY  (`id`)
);


-- 文書中に出現した単語を保存するテーブル
CREATE TABLE `document_terms` (
  `document_id` int(11) NOT NULL,
  `term_id` int(11) NOT NULL,
  `frequency` int(11) default NULL COMMENT '出現回数',
  `tf` float default NULL,
  `score` float default NULL COMMENT 'TF-IDFのスコア',
  PRIMARY KEY  (`document_id`,`term_id`)
);


-- 単語テーブル
CREATE TABLE `terms` (
  `id` int(11) NOT NULL auto_increment,
  `name` varchar(50) NOT NULL,
  `idf` float default NULL,
  PRIMARY KEY  (`id`),
  UNIQUE KEY `name` (`name`)
);


-- 単語同士の情報量保存テーブル
create table term_matual_infos(
  term_id int not null,
  tar_term_id int not null,
  score float not null
);

こんな感じ。

「海女」が対象の単語で，
1段落目が共起情報から抽出した単語リストで，
2段目が相互情報量を用いて抽出した単語リスト．
「海女」に「就職」が出てるのは，
最近美女が海女さんになった(就職)ってニュースから来てる様子．
(参考URL：http://www.youtube.com/watch?v=nlbSpQqCV9E)

メモ１：
このリストは気付きには使えるかもしれないけど，
似ている単語としてクラスタリングするのには使えないと思った．

メモ２：
情報量の定義を色々変えてみて後でまた試したい．
出現頻度の値域を0〜1にするとまた変わるのかなー．