相互情報量(mutual information)についてのメモ2

テスト結果をつらつら書いてると長くなりそうなので記事を分けた．

情報量にidf値を用いてやってみた．
単語Aが出たら単語Bも出るっていう確率は分子で行っているので，
分母はidfってゆー，対極的重み付けのスコアを使ってみる．



相互情報量 = log(

  (全文書に出現した単語の総数 * 単語Aと単語Bの共起頻度)

   / 

  (単語Aのidf値 * 単語Bのidf値)

)

発行したSQLは以下の通り．

select
  t1.id,
  t1.name,
  t2.id,
  t2.name,
  log(
    -- T1
    (
      -- 語の総数
      (select count(id) from terms as t_cnt)
      *
      -- AとBの共起回数
      (
        select
          count(term_id) as co_cnt
        from document_terms a
        join terms b
          on a.term_id = b.id
        where
          document_id in (
            select document_id from document_terms where term_id = t1.id
          ) 
          and term_id != t1.id
          and term_id = t2.id
        group by term_id
        order by co_cnt desc
        limit 1
      )
    )
    /
    -- T2
    (
      -- Aの情報量(idf)
      t1.idf
      *
      -- Bの情報量(idf)
      t2.idf
    )
  ) as mutual_information
from
  terms t1
join
  terms t2
  on t2.id != t1.id
where
  t1.id = 12198
  and
  t1.idf > 2
  and
  t2.idf > 2
-- group by
--  t1.id
having
  mutual_information is not null
order by
  mutual_information desc
limit
  30
;