Foognostic blogs Seeking knowledge of foo

14Jun/09Off

Reduce your way into good Clojure

Standard disclaimer: Clojure is a new hobby.

In my previous post I discussed replacing for loops with the (map) function. That works, until it doesn't.

Another step needed to generate the cosine similarity for two strings is to create a frequency histogram, or how many times each character pair occurs in a string. This is a pretty good fit for a hash map, where the keys are the character pairs, and the values the occurrences.

Here was my initial try. This code meant to do well... but went far, far away from where I wanted. Let's take a detailed look at my failings:

user=>
(let [hm (hash-map)]
  (map
    (fn [key val]
      (assoc hm key val))
      ["a" "b" "a"]
      [1 2 3]))

What I wanted: a map like this => {"b" 2, "a" 3}. What I got was three hash maps, each with one key/value pair => ({"a" 1} {"b" 2} {"a" 3}). The intent was to use map to iterate over the sequences, and use assoc to put the key and value into one hash map. And now, for the parade of errors...

  1. Use the zipmap function to build hashmaps like this: (zipmap ["a" "b" "a"] [1 2 3]) => {"b" 2, "a" 3}
  2. Variables in the let block cannot be changed...
  3. ... but I really needed to update the hash map defined in the let block.

Rather than continue to stew uselessly I used Emacs to hop into the #clojure IRC channel. A wonderful person suggested the reduce function; I'd used it once in Ruby where it is best known as inject. I'm going to skimp on my description of reduce a little since that article is so well done.

reduce iterates over a collection like map, but it passes a mutable context to each callback. The return value of reduce is the final value of the context... essentially. I am still coming up to speed on it obviously.

Anyhow, enough yammering. Here is reduce in action:

(defn dot-product [l_histo r_histo]
  (reduce
   (fn [product key]
     (+ product
        (* (get l_histo key 0)
           (get r_histo key 0))))
   0
   (keys l_histo)))

Key points:

  • [l_histo r_histo] -- these are the arguments to the dot-product function.
  • (fn defines an anonymous function.
  • (get gets the value for the specified key from the specified map, returning the final argument when the key is absent.
  • 0 is the default value for product
  • (keys returns the keys of the specified hash map
  • Clojure really wins a lot by delegating to Java. This happened for free: (Math/sqrt 100) => 10.0

So, reduce is a critical step for people coming from imperative programming languages looking to do basic things with collections.

Filed under: clojure, code, lisp No Comments
14Jun/09Off

Put that for loop down

Standard disclaimer: Clojure is a new hobby.

I've been spending a lot of time with Clojure recently to prepare for the ICFP 2009 contest. It hasn't been the easiest thing, but the difficulty has come from trying to grasp many new concepts at once. The documentation has been pretty good, and the IRC channel has been pretty worthwhile.

I'm trying to settle on a specific task to accomplish when learning a new language. Something concrete but still a little academic... implementing cosine similarity using character bigrams isn't much code, but it covers enough bases to be useful.

Part of the process involves splitting a string into overlapping character pairs.

Bad old me instinctively thinks "for loop".

Good new me thinks... bad old me probably has it right. But it doesn't feel very Clojure-y. This solution is a little more Clojuriffic, probably still a long shot from idiomatic:

(defn split-bigrams [text]
  (map
    (fn [l r] (str l r))
    (seq text)
    (rest (seq text))))

Bad old me (BOM) would have iterated over the string using a for loop with an index variable, cycling length - 1 times.

Good new me (GNM) had to take a different approach. I converted the string into two sequences; the first was just the string, but the second was the string minus the first character.

sequence A : [A B C D E F]
sequence A': [B C D E F]

The (map) function takes a callback and a number of sequences. The callback function gets the nth element from each sequence, and the output from the callback is collected and returned by (map). I don't have to specify a length or repetition count, as (map) stops processing when any of the collections run out of data.

user=> (split-bigrams "ABCDEF")    
("AB" "BC" "CD" "DE" "EF")

This hasn't taken a lot of time, but more than I like. It has taken A LOT of patience, energy, and focus during that time. But I like finding new ways to solve problems. Now I have a new tool in my nerdbelt, one that pushes the for loop back a little bit.

Filed under: clojure, code, lisp No Comments
10Apr/09Off

Configuring projects in org mode, or defining variables in Lisp is strange

Org mode in GNU Emacs makes it somewhat easy to publish its files as a small static website. You define the files to publish, how to convert them, where to publish them, and so on. The bad news is that you need to manually write a big data structure in Emacs Lisp. Not what a Lisp newbie really wanted to hear (no customize page?) but I decided org mode was worth the pain. Speaking of pain, here's the data structure:

 0:  (set
 1:    (quote org-publish-project-alist)
 2:    '
 3:    (
 4:      (
 5:        "tsn"
 6:        :base-directory "~/Documents/foog/tsn"
 7:        :base-extension "org"
 8:         :publishing-directory "/dev/null"
 9:         :publishing-function org-publish-org-to-html
10:      )
11:    )
12:  )

All that work just to define a list of lists? I could just write it in Ruby as:

 # pretend this function exists in Ruby
 # def org_publish_to_html() end
org_publish_project_alist = [
    [ "tsn",
      :base_directory, "~/Documents/foog/tsn",
      :base_extension, "org",
      :publishing_directory, "~/Documents/foog/tsn_export",
      :publishing_function, org_publish_to_html ]
];

The major difference is all this

set
and
quote
nonsense. How many function calls does it take to initialize a variable in Lisp?? It turned out that answering that question well required some spelunking into Lisp. It kind of makes me appreciate GNU Emacs Lisp as a low-level high-level language.

And now, we plunge through the rabbit hole!

So, why call the

(quote)
function when defining a variable? I didn't realize what a good question this was. The answer means paying close attention to references and values. Let's take a quick look at a function which determines whether its argument is a value or a reference:

(symbolp 123)           => nil  ;; nil is false in Emacs Lisp
(symbolp (quote 123))   => nil  ;; 123 passes through...
(symbolp (quote "abc")) => nil  ;; abc passes through
(symbolp (quote foo))   => t    ;; but foo is a symbol.

Okay, so now we can distinguish between a value and a reference. The next step is assigning a value to a reference. This is where

(quote)
is essential.

(set foo 123) => Lisp error: (void-variable foo)

That silly Lisp interpreter, trying to get a value while we are in the middle of setting it. That's where

(quote)
comes in. The function call to quote is evaluated, and the result is passed as the first argument to
(set)
. That makes some sense, but because variable definition happens ALL THE TIME in imperative programming languages Emacs Lisp has tried to make this as easy as possible:

(set (quote foo) 123)  => 123  ;; No one does this
(set 'foo 456)         => 456  ;; 'foo = (quote foo)
(setq foo 789)         => 789  ;; People do this one.

INTERMISSION

Now that variable assignment has been hashed out we can pick up speed a bit and finish this off:

  • Line 1 specifies the symbol we want to use.
  • Line 2 is just a macro for
    (quote)
    , so the contents will be returned without being evaluated.
  • Line 3 starts a list. To me this implies that the variable
    org-publish-project-alist
    will contain an array of ... well, not sure yet.
  • Line 4 starts a list. This list will be the first item in the list defined by
    org-publish-project-alist
    .
  • Line 5 is just a string. It is the first element in the list. Calling
    (car)
    on the list started at line 4 would return "tsn".
  • Line 6 contains the 2nd and 3rd items in the list.
    :base-directory
    is a keyword symbol (
    (symbolp :base-directory) => t
    ). "~/Documents/foog/tsn" is the next element in the list.
  • Lines 7-9 define the remnant of the list started at line 4.
  • The remaining lines are the obligatory closing parens.

Probably because I am very new to Lisp I don't understand why flat lists are used to store this data. Key value pairs are a great fit for hash maps or associative lists (the variable name ends with alist...). My first attempt to restructure the data would be:

(setq org-projects
  '(
     ("tsn"
       (:base-directory . "~/Documents/foog/tsn")
       (:base-extension . "org")
       (:publishing-directory . "~/Documents/foog/tsn_export")
       (:publishing-function . org-publish-org-to-html))
     ("css" 
       (:base-directory . "~/Documents/foog/tsn/css")
       (:base-extension . "css"))
     ("img" 
       (:base-directory . "~/Documents/foog/tsn/images")
       (:base-extension . "jpg|png"))
  ))

The difference being that key value pairs are stored in an associative list. This lets you take advantage of built in functions for getting property values.

(mapcar
  (lambda (elt)
    (let
      ((project (car elt))
       (properties (cdr elt)))
      (print (format "Project %s has base extension %s"
                      project
                      (cdr (assoc :base-extension properties))))))
  org-projects)

Output: ("Project tsn has base extension org" "Project css has base extension css" "Project img has base extension jpg|png")

Wow, look at all the rambling. At least now I understand how to manually configure org mode's projects.

Lisp is neat.

Tagged as: Comments Off
   

Foognostic blogs is Digg proof thanks to caching by WP Super Cache