Scalaを使ってみる: (5) プログラム作成 (mutable版) - 屯遁のパズルとプログラミングの日記

Scalaを勉強している．勉強中の身ではあるが，以下を例題として，Scalaプログラムの作り方について説明してみる．

テキストファイル中に現れる英単語の出現回数を数えて，出現回数の多い語から表示する．

入力のテキストファイルとしては，Project Gutenberg 中のHalmet を用いる(ファイル名を hamlet.txt，改行はLFにした)．
なお利用している環境は，Ubuntu 8.04 LTS上の Scala 2.8.0 RC2 (2010年5月10日リリース)， Java 1.6.0である．

リンク

ここまでのまとめ

ここまでのmutable Mapを用いた方法をまとめると以下のようになる．

  def getLines(f: String) =
    io.Source.fromFile(f).getLines().filterNot(_.matches("\\s*"))
  def toWords(s: String) = "[a-zA-Z]+".r.findAllIn(s).map(_.toLowerCase)
  val words = collection.mutable.Map[String,Int]()
  def countUp(w: String) = words += (w -> (words.getOrElse(w, 0)+1))

  words.clear
  getLines("hamlet.txt").flatMap(toWords).foreach(countUp)
  for (w <- words.keys.toList.sortBy(words(_))(Ordering[Int].reverse))
    yield (w,words(w))

2010年11月29日追記: io.Source.fromPath を io.Source.fromFile に修正した．

プログラム作成

以下のようなプログラムにまとめてみた．

import scala.io.Source
import scala.collection._

object WordCount1 {
  def getLines(f: String) =
    Source.fromFile(f).getLines().filterNot(_.matches("\\s*"))
  def toWords(s: String) =
    "[a-zA-Z]+".r.findAllIn(s).map(_.toLowerCase)
  def countWords(iter: Iterator[String]) = {
    val words = mutable.Map[String,Int]()
    def countUp(w: String) =
      words += (w -> (words.getOrElse(w, 0)+1))
    iter.flatMap(toWords).foreach(countUp)
    // iter.foreach(toWords(_).foreach(countUp))
    words
  }
  def sortReverse(map: Map[String,Int]) =
    map.keys.toList.sortBy(map(_))(Ordering[Int].reverse)

  def main(args: Array[String]) {
    val fileName = args(0)
    val words = countWords(getLines(fileName))
    for (w <- sortReverse(words))
      println(w, words(w))
  }
}

2010年11月29日追記: io.Source.fromPath を io.Source.fromFile に修正した．

速度の計測

hamlet.txt だとサイズが小さすぎるので，たとえば "aaaa" から "zzzz" までのすべての語を出力する関数を作成した．

  def f(n: Int, s: String) : Unit =
    if (n == 0) println(s) else for (c <- 'a' to 'z') f(n-1, s+c)

これを (1 to 10).foreach(i => f(4, "")) として実行すると， 26⁴ = 456976 通りの語を10回出力するので，それを words.txt として保存した．
各語が改行を含めて5バイトなので，全体では 10*456876*5 = 22843800 バイト(約23MB) となる．
実行結果は，以下の通りである．

  $ scalac WordCount1.scala
  $ time scala WordCount1 words.txt >/dev/null
  real  0m41.647s
  user  0m45.899s
  sys   0m2.504s

何度か実行したが，システム時間を含めたCPU時間は約 48 秒だった． HashMapを用いてJavaで書いたプログラムで実行すると，そちらのCPU時間は約 35 秒だった．
そこで，scalaプログラムの getLines を以下のように変更した．

  def getLines(f: String) = new Iterator[String] {
    val reader = new java.io.BufferedReader(new java.io.FileReader(f))
    var line: String = reader.readLine
    def hasNext = line != null
    def next = { val nxt = line; line = reader.readLine; nxt }
  }

こちらのCPU時間は約 35 秒で，Javaとほぼ同じCPU時間だった．

リンク

ここまでのまとめ

プログラム作成

速度の計測

「Scalaを使ってみる」の目次