Author Archives: drpriceless

Clinical UI Design Sprint – Self Challenge

Starting Some Design Sprints

I have decided to start putting into my schedule more “design sprints” as a way of flexing my UI/Ux design muscles in prep for future work.
I came from a UI design background, so I am getting back to some roots through this activity.

Like any good project, I have defined some constraints for a sprint. Here are my constraints:

  1. I have one week for a sprint with a topic of interest to me (does not have to relate to a project).
  2. I must complete 1 pomodoro sprint daily (if you do not know those, they are essentially a 25 minute block of focused effort followed by a rest)
  3. I must complete 10 pomodoros through a work week (Monday to Friday)
  4. I will document my work.
  5. I ship the design on the weekend.

I am focusing on healthcare IT design considerations, such as portions of the EMR and personal health services.

I will use methods that I can do myself, so I do not have to coordinate ethics, users, etc. This is meant to encourage with as few barriers as possible and, let’s be honest, waiting for recruitment for a focus group is not a pomodoro.

Methods I am using include

  1. Simple web research
  2. Mood boards
  3. Personas
  4. Activity modeling
  5. Use Case Mapping
  6. Wireframing / Prototyping

I have picked these because they will fit into pomodoros and they have a visual component so they might be interesting to share online.

Anyone is welcome to join me on a sprint.

My first attempt was a sprint on the problem list in EMRs and I documented the work in a little video.

It was interesting and good to do. I will be making more space to do these in the coming months as I find them recharging.

On Symbolic Execution and more

Wait, what?

It is exactly what it sounds like.

Symbolic execution implies assumption of abstract symbolic values for variables to determine flow behaviour of the program.

This concept was put forth in a paper titled “Symbolic Execution and Program testing” by James C. King. Rather than using the data objects provided by the programming language for storing values, we use arbitrary symbolic values for computation. You can see that the normal computation is a special case of symbolic execution wherein we “concretize” the values. Let us take a look at some assumptions made by the author.

Assumptions

  1. Computer programs deal only with integers.
    This assumption is fairly valid and is almost harmless in many analyses. Although a lot of programs work with float and double values, to reduce the search space, we will assume integral values only. We will also ignore integer overflows(in the case of 32/64 bit integers in C, for example).

  2. The execution tree resulting from many programs is inifinite.

  3. Symbolic execution of IF statements requires theorem proving which is mechanically possible.
    This is a very critical assumption and forms the basis of all analysis performed on the execution tree.

Why symbolic execution?

It is the “workhorse” of modern software testing. As with my previous blog post, Symbolic execution saw a rise in interest and research due to increase in CPU speed and the ability to execute things parallely. We will look at how exactly this is useful. Also, it is one of the very few ideas which has made it to the industry and practice, breaking out of the academic bubble`.

How does it compare to the current techniques of static and dynamic analyses?

Symbolic Execution provides a some advantages over both these approaches. Let us look at them:

In comparison to (dynamic)testing it allows us to explore a more broader search space. We can make sure that the program is tested on almost all possible inputs before it is shipped.

Static Analysis has seen a lot of growth in the past few years. We have have better approaches for applying techniques on binaries and the source as well. But one of the fundamental issues with applying such algorithms is that they can point it out to you that there is a possibility of an error, but can never predict/output the input which can trigger the exploit. Symbolic Execution scores a +1 in that regard.

I think the above two reasons are enough to peek under the hood?

Control Flow graph and understanding the semantics of the flow

As wikipedia defines it,

A CFG is a representation using graph notation, of all the paths that might be traversed through a program during it’s execution.

A symbolic execution engine works on this exact representation of the program. Instead of variables having numeric(concrete) values, we choose one of the infinite abstract values and give it to the variable. Let us check out an example:

int main() {
int i;
scanf("%d", i);
if (i==5) {
  printf("Error.");
} else {
  printf("Yay.");
}
return 0;
}

Now we can draw a CFG of the above program as follows:

From the above graph we understand that at the IF conditional statement the graph branches in two. This means that depending on the conditions imposed on the various variables in the program in the IF statement, the program is going to behave differently. In this particular case, if the value of the variable i is 5, the control flow is going to be leading to the true branch and to the false branch otherwise. The rules for evaluation of these statements must be extended to include symbolic values as well.

Understanding the generation of the path equation

A program state is usually defined by the value of its variables(global and local, ofcourse) during normal executionl. But King, in his paper describes the generation of a path equation which has to be included in the program state. A path expression or a path condition(pc) is always initialized as true. As the program follows a particular execution path, more and more conditions are conjoined to the pc. Each statement thus contributes to the path expression(unless it’s a trivial assignment).

An interesting situation arises for the IF condition. A trivial case for this would be if the condition evaluates to one of the two boolean variables. If it does, we continue with that branch and do not consider the other one. But in case we cannot determine at symbolic time what the condition will evaluate to, then we need take a look at both the branches. This happens due to the involvement of the symbolic variable in the equation.

We now fork the execution into two branches. One, where the true branch was taken, and another where it was not. This will basically lead us to generate two expressions. One will be of the form: pc AND (i==5) and another pc AND (i NOT= 5). Now symbolic execution for the rest of the program analysis continues with both these states in memory.

Let us see the generation of the path expression of the above C program:

  1. Statements 1, 2: A fresh new symbolic variable is assigned to i. Let us call it alpha. Current pc: true
  2. Statement 3 : An if condition is encountered. Since it cannot be evaluated, we generate two conditions. pc: true AND (alpha == 5) | true AND (alpha NOT= 5).
  3. Statement 4 : Since this is the last statement in the program, we have already evaluated all the branches, and hence the expression here would be the logical OR of both the expressions in memory.

Thus, after evaluating the entire (execution) tree in this manner, we will end up with certain conditions on the variables for each leaf node. This is a completely exhaustive search mechanism since we have assumed the values to the totally abstract and not provided any information on what they could be(apart from being integers).

Below is the tree representation:

This is a very simple example but it gets the idea through. Although things get a lot more messier with assignments and goto statements, the concept remains the same. The material above is enough to understand basic literature on the topic.

Theorem provers and SMT solvers

Well, now what? We have the path expression of the entire program. Assuming we want to generate test cases for the ERROR branch, we need to solve the boolean expression. Enter, SMT Solvers! SMT(Satisfiability Modulo Theory) Solvers try and solve decision problems. Since, throughout the program we are applying constraints on each variable, these are also known as constraint solvers. Microsoft’s z3 is one of the most used(in production) theorem prover freely available online.

They take in an expression containing symbolic variables and constraints which are imposed upon them. They then try and give us a result depending on the satisfiability of a problem. Sometimes it may give us our result in seconds whereas on other times it may tell us that the problem is unsolvable. But also, solvers fail to deduce either of the facts since it is a NP Complete problem. The algorithms used for these solvers internally, is a matter of a completely different blog post altogether but they basically try and combine multiple expressions from the given path condition and see if they can find a valid solution. Since the search space is very high(unless otherwise bound in a constraint), it takes a lot of time. This also, to certain extent depends on the number of conditions but it’s performance decreases as the number of expressions grow.

Are we going to have a demo or what?

In Clojure? Sure! We are going to try and solve a Project Euler problem 9. The problem statement is as follows:

A Pythagorean triplet is a set of three natural numbers, a < b < c, for which,
a^2 + b^2 = c^2
For example, 32 + 42 = 9 + 16 = 25 = 52.
There exists exactly one Pythagorean triplet for which a + b + c = 1000. Find the product abc.

Pretty easy to just iterate over all the possible values over 1000. But we will try and do something new! \o/

Okay, now let’s say that a, b and c are symbolic variables. We will use the clojure.core.logic library for this.

(ns clj.p9
  ;; conflicting operator
  (:refer-clojure :exclude [==])
  (:require [clojure.core.logic :refer :all]
            [clojure.core.fd    :as fd]))

(defn solution
  ;; Let us say we take as input the sum of the three pythagorean variables and input the result.
  [sum]
  ;; We will have the result in a vec as [a b c].
  ;; Since we want the product a*b*c, let's just multiply them together.
  ;; We use the function first since the two results provided are:
  ;; [a b c] and [b a c] with c being the hypotenuse. Hence we choose one of them.
  (reduce * (first
  ;; Here we go! `run*` implies that we want the computer to compute
  ;; all the possible values for the expression `q`.
  (run* [q]
  ;; Declare 3 fresh symbolic variables for a, b and c.
  (fresh [a b c]
  ;; The solution should be a vec of values of the variables a, b and c.
  (== q [a b c])
  ;; Now we know that all that all the number have to be distinct.
  ;; We hence use the fd/distinct to add that constraint.
  (fd/distinct [a b c])
  ;; Another contraint we can derive is that since the sum of two sides is always &gt; than the third,
  ;; each side has to be lesser than half of the `sum` value.
  ;; This can reduce our search space by quite a lot. fd/in provides us an option to specify the range.
  ;; (fd/interval [start] [end]) is the actual numeric range.
  (fd/in a b c (fd/interval 1 (/ sum 2)))
  ;; We now add the given constraints on the variables.
  (fd/eq
    ;; sum = a + b + c
    (= sum (+ a b c))
    ;; c^2 = a^2 + b^2 (since they are pythagorean triplets)
    (= (* c c) (+ (* a a) (* b b)))))))))

EZ PZ. Let’s run this!

(time (soln 1000))
"Elapsed time: 2345.989466 msecs"
31875000

There we have it! Solved in 2.3 seconds. Although this was fairly simple, but we learnt a lot on the way. core.logic is not as highly optimized as z3 but is fairly fast. It is made for a relational algebra which is a superset of symbolic analysis. z3 has a nice Python API to interact with it!

Issues involved

Current research trends include the inclusion of analysis of source and raw binaries as well. Program testing can be approached in an entirely new way if the inputs to methods(maybe?) are considered to be symbolic and we then fuzz test the over the entire search space. It is although very important to realize that there are a lot of problems which can come up during practical implementations:

  1. Path explosion: If there are a lot of branching statements in the program, there will be generation of a new path expression at each branch hence creating a new instance of the symbolic runtime increasing the number of paths. This is a very serious problem and requires smart inputs from the user to prune certain branches off.

  2. Environment interactions and system calls: A symbolic engine will find it very tough to interact with the underlying OS or the runtime environment. This can hinder a lot of core systems testing. The program may take decision based on the return value of a particular read statement or write to a particular file and then taking decisions. There is still ongoing research to find implement a virtual model of the underlying stack for the engine to interact with.

A thought experiment: Symbolic Execution in Healthcare?

Medical systems need to completely fool proof and verified before they are sent to production. Symbolic execution and analysis can be absolutely critical in such an analysis. Like normal programs, even though certain inputs are completely pointless(for example providing a negative number to a square root calculator) it is important that we implement correct error handling.

Although the above is practical, it would be more interesting to apply this logic to the human body directly which is… let’s just say it’s a though experiment. Some of my thoughts related to the ideas below stem from the discussion here. Below are some assumptions I make:

  1. Since the human body is made up of a finite number of atoms, at any point in time there is a set of values which describe the entire body state(Assuming we manage to deconstruct the brain).

  2. Although huge, we have a model of the human body which can be provided to a computer to generate an approximate system which will emulate the body internals and reactions to stimuli. Although far-fetched, we are slowly inching toward such a goal.

  3. The computer we imagine is fast enough to perform calculations on such a scale.

Imagine medicine in such a scenario. Assume a medicinal drug to be a symbolic variable and the program to be the human body itself. The medicine can be thought to be an input to the body and hence the system should be able to track the breakdown of such a medicine in the body. The IF conditions in this case would represent exsiting patient conditions and his body state which would lead to different reactions in the body. Once we have all the leaf-nodes of a graph of such a system, the resulting information can be highly infomative in providing crucial healthcare and desigining of drugs, since we will have data regarding each breakdown of the chemicals. Drug design can be highly benefit from such a scenario!

The possibilities are endless. For now, let me get back to ValidIS.

References!

[1] Symbolic Execution and Program Testing, JC King.
[2] Lecture notes on Symbolic Execution, University of Maryland.

Solving the trivial problem of dimensional analysis in Clojure

Functional Programming

Functional Programming (or fp as it is often called) is a paradigm of programming which aims at describing what we want to do rather than how we want to do it. The entire universe of fp revolves around the concept of pure functions.

A pure function is a function where the return value is only determined by its input values, without observable side effects.

Imagine if you will, a conveyor belt which takes in raw material as input. There are a lot of tools which process the input as it goes along, eventually producing the desired result. You can consider the concept of functional programming to be similar to the above situation in a number of ways. The conveyor belt can be thought of as the program with each of the tools being a pure function(does not produce any external side effects, depends only on input). The only difference is that on the belt, the original resource is modified in place, whereas in the program, the data is worked upon and a new instance is passed on to the next function.

Although fp has been around for a while, due to the advent of multi-core systems, big data and distributed computing it is seeing a resurgence in terms of interest in the programming community. It is backed by very strong literature and there is a lot of ongoing research on the same.

Some of the principles that fp operates by are as follows:

  1. Immutable Data structures: This considerably reduces the resources we spend on memory allocation and storage. So, how do we program with such an impairment? Rather than passing data around, we can operate on the data and create a new instance. This is still more efficient than modifying it in place. It allows the compiler to make certain assumptions that are unsafe in an imperative language.

  2. Higher Order functions: We can consider them as functions which take in other functions as arguments or return them as results. The idea is derived from lambda calculus. One of the best examples of a HOF is the function map which takes in a function as an argument and maps it to the list of elements provided as the second argument. Since each function application is independent of another(assuming pure functions) there is a lot of scope for improving the approach by running the same program on a multi-core system. Each of the results can then be aggregated using a function such as reduce.

  3. Seperation of Functions and Values: To an OO person, the concept of encapsulation is very pivotal while programming. This allows the class to have an internal state which can be modified by calling class methods which operate on these values. In fp, both of these entities are clearly seperated.

  4. Statelessness: Since all the functions are pure they have no idea of what has happened or what is going to happen to the input it receives. It only cares about operating on the data and producing the output. This fact combined with immutability is extremely useful for unit testing since each function has a specified set of duties it has to perform and in doing so, it will not affect any external state/variables. The state of the program is kep local and methods(or functions) are completely pure in the fact that they do not hide any abstraction.

Clojure

Clojure is a dynamic, general purpose programming language, combining the approachability and interactive development of a scripting language with an efficient and robust infrastructure for multithreaded programming.

Rich Hickey is a genius. Clojure sits on top of the JVM, leverages Java’s type system and supports calling Java code from within. Clojure has a set of internal data structures such as maps, sets, vectors and lists. It also supports lazy sequences, immutability and persistent data structures. It also boasts of a typical Lisp syntax(which can be a little intimidating for the novice programmer).

Why am I doing this?

I was inspired by the solution to the problem as given by Peter Henderson in his book on Functional Programming and its applications. It is described in a very general manner and I thought it would be interesting to implement it in Clojure using a few of my own variations.

Why Clojure?

It’s awesome. It’s functional and thinking functionally gives you a way to come up with interesting solutions to common problems encountered during programming.

Cutting to the point

Dimensional Analysis is a very simple method to check if a physical equation is valid or not.

The dimensions of a quantity refer to the type of units that must be used in order to obtain the measure of that quantity. The type of units used are the 7 fundamental units which cannot be described with the help of anything else. We will be looking mainly at equations which involve quantities made up of Mass(M), Length(L) and Time(T).

We will assume that the equation is linear in all the physical variables and operators used are that of multiplication, division, addition and subtraction. Although this makes things a lot simpler than they already are, it will help us set up a basic framework for solving such a problem.

Understanding the problem

We will be trying to solve the following problem:

Assume that we have a set of variable quantities v1, v2, etc. each with a unique single alphabet symbol, and a given set of dimensions a1, a2, etc.

We will be trying to determine two things:
1. If the equation is dimensionally correct.
2. The dimension of the resulting quantity.

Let us assume a very straightforward model of an equation.

  1. We will consider that the equation is made up of Expression(s).
  2. Each Expression has a Left Hand Side(LHS) and a Right Hand Side(RHS) with an Operator acting upon them.
  3. The most fundamental quantity in the equation will be an Atom. An Atom can be a constant or a variable.
  4. Each quantity has a Dimension where we mention the exponents to the base quantities.

We will be using schema for data descriptions. schema provides us to functions validate the data and find out if it conforms to the given schema descriptions.

;; schema is required as 's'.

;; Dimension will contain the exponents to M(Mass), L(Length) and T(Time).
;; With s/Int, we specify that the values have to be integral.
(def Dimension
  {:M s/Int
   :L s/Int
   :T s/Int
   })

;; An Atom will contain a type specifying if it is a variable or a constant.
;; Let us also have the symbol of the variable, in the data definition.
;; Most importantly, the dimension of the quantity has to be mentioned as well.
;; As you will observe, the dimension is described in terms of the earler schema.
;; This will help in Atom validation.
(def Atom
  {:Type      (s/enum :variable :constant)
   :Symbol    s/Str
   :Dimension Dimension})

;; Let us also define the dimension for a constant quantity.
;; Since it is a constant it will not have exponents in any of the quantities.
(def Constant
  {:M 0
   :L 0
   :T 0})

;; Since we are assuming a simple case, let us say that the operations can be addition,
;; subtraction, division and multiplication.
;; The model is so designed that it can be easily extended to other operations.
(def Operator
  (s/enum :add :subtract :divide :multiply))

;; We will now describe the definition for Expression.
;; An expression can be described in a recursive manner.
;; The LHS and the RHS both can be Expressions or Atoms. schema provides a nice way to do this.
(def Expression
  {:LHS (s/conditional #(= nil (:Type %)) (s/recursive #'Expression) :else Atom)
   :RHS (s/conditional #(= nil (:Type %)) (s/recursive #'Expression) :else Atom)
   :Operator Operator})

;; We first check if the data in the LHS quantity has a Type field.
;; If it does, we know that it is an Atom, and we direct the checker to validate it for us.
;; Instead, if it does not, we can check for an Expression instead.
;; This can definitely be optimized but I thought this was one way to do it.

Coming up with functions for dimension calculations

Now, that we have the basic definitions sorted out, we can look forward to implementing some functions for checking dimensional correctness.

Assume a function which takes in an Expression or an Atom. We need to find out if the given expression is dimensionally correct. Let us divide the possible scenarios into cases such as:

  1. Input is an Atom. We can then return the dimension of the quantity.
  2. Input is a constant. We can then return Constant which is basically a dimensionless quantity.
  3. Input is an Expression with Operator :add. In this case, we need to check if the HS and the RHS have the same dimensions. If they do, we can return dimension of either. If they do not, we then return an undefined value (say UNDEF).
  4. Similar is the case for subtraction as well.
  5. Input is an Expression with Operator :multiply. We don’t need to check anything for this, but we return a sum of the dimensions of the two quantities involved.
  6. Input is an Expression with Operator :divide. Again, we return the difference between the dimensions of the numerator(LHS) and the denominator(RHS).
;; A generic undefined value. We can use this to convey that the equation is dimensionally incorrect.
(def UNDEF
  "INCORRECT")

;; Used to calculate the dimensions of the reulting quantity after the product of the two inputs.
;; If either of the quantities is UNDEF, this function returns UNDEF.
(defn product
  [dim1 dim2]
  (cond
    (= UNDEF dim1) UNDEF
    (= UNDEF dim2) UNDEF
    :else {:M (+ (:M dim1) (:M dim2))
           :L (+ (:L dim1) (:L dim2))
           :T (+ (:T dim1) (:T dim2))}))

; Used to calculate the dimensions of the reulting quantity after the divison of the two inputs.
;; If either of the quantities is UNDEF, this function returns UNDEF.
(;
(defn ratio
  [dim1 dim2]
  (cond
    (= UNDEF dim1) UNDEF
    (= UNDEF dim2) UNDEF
    :else {:M (- (:M dim1) (:M dim2))
           :L (- (:L dim1) (:L dim2))
           :T (- (:T dim1) (:T dim2))}))

;; Main logic function. Takes in an expression as an argument.
;; This function gives as output the dimension of the resulting quantity,
;; based on the Expression map provided to it.
;; If the equation is dimensionally incorrect, it will return UNDEF.
(defn dim
  [expr]
  (let [lhs           (:LHS expr)
        rhs           (:RHS expr)
        operator      (:Operator expr)
        expr-type     (:Type expr)]
    (cond
      (= expr-type :variable)  (:Dimension expr)
      (= expr-type :constant)  Constant
      (= operator :add)        (if (= (dim lhs) (dim rhs))
                                 (dim lhs)
                                 UNDEF)
      (= operator :subtract)   (if (= (dim lhs) (dim rhs))
                                 (dim lhs)
                                 UNDEF)
      (= operator :multiply)   (product (dim lhs) (dim rhs))
      (= operator :divide)     (ratio   (dim lhs) (dim rhs)))))

Setting up grammer for parsing the equation

Now that we have the general setup ready, we can go ahead and worry about actually taking user input and parsing it into the given Expression format. For this, we will be using the instaparse library. It allows you to specify your own grammer and builds a tree around the same.

Let us consider the simplest linear equation we will be looking to solve:

v = u + at

Newton’s first law of motion. Very simple to understand. But parsing it might be a pain. To make things slightly more tougher, let’s say that the quantities in the equation can have integral co-efficients as well. So it becomes:

v = Xu + Yat where X,Y are integers

We basically can use a parse tree which is used for arithmetic and make a few changes to it to include the physics-terms or as I would refer to them as pterms.

;; Introducing pterm as a new grammer rule.
;; pterm is basically a coefficient and multiple symbols multiplied together.
;; The output format specified here is :enlive which will help us in formatting the parsed data later.
(def physics
  (insta/parser
    "expr       = add-sub
    <add-sub>   = mul-div | add | sub
    add         = add-sub <'+'> mul-div
    sub         = add-sub <'-'> mul-div
    <mul-div>   = pterm | term | mul | div
    mul         = mul-div <'*'> term
    div         = mul-div <'/'> term
    term        = number | <'('> add-sub <')'>
    number      = #'[0-9]+'

Parsing the tree!

Alright! Let us dive directly into the code:

;; instaparse provides us with an awesome function which is known as transform.
;; transform allows us to traverse the tree from bottom-up and
;; act upon each node data and replace it by operating upon it.
;; Since :expr will the root node, it will eventually contain the entire expression
;; map, hence we provide it the identity function.
(def parse-tree->schema-def
  {:sym         replace-sym
   :coefficient replace-coefficient
   :add         replace-add
   :sub         replace-sub
   :div         replace-div
   :term        replace-term
   :pterm       replace-pterm
   :expr        identity})

;; Each of the replace-"operator" function can be defined as follows:
;; (Considering replace-add for example)
(defn replace-add
  [term1 term2]
  {:Operator :add
  :LHS term1
  :RHS term2})

;; This allows us to simplify the function logic to a great extent.
;; The parser will give us with the the :add node with two children,
;; each representing a side of the operation.
;; As mentioned before, this is true for all functions.

Now, we have the basic setup for conversion ready, but again, we are missing a few important functions:

  1. replace-coefficient: Since a co-efficient is going to be a constant, we can just create an Atom map for the constant and return it. The Symbol for it can be the value of the co-efficient.

  2. replace-term: As we can see from the parsing rules above, term will always contain integral value and hence we can just return the Constant defined earlier.

Now, two most important functions left are replace-pterm and replace-sym. Let us see how we can tackle those:

;; The pterm is basically a multiplication of the co-efficient term and the symbol terms.
;; Assuming we have both of them defined correctly:
(defn replace-pterm
  [coeff-term symbol-term]
  {:Operator :multiply
   :LHS coeff-term
   :RHS symbol-term})

;; Now, this is a bit complicated.
;; Now, we can have multiple symbols being multiplied to each other.
;; Our expression model is pretty simple in that we have an LHS and an RHS both of which
;; can either be an Atom or an Expression.
;; Suppose we have the term "w*h*a*t", we can consider it as
;; w * (h * (a * t))
;; This exact same algorithm is implemented below.
;; It takes in the first element of the symbol list and recurses on the rest.
;; If we have a single element, we call find-sym which we will be describing later,
;; but if we have 2 elements, then we know that we have reached the end of the list and
;; can create an multiplication Expression.
(defn replace-sym
  [& sym]
  (case (count sym)
  1 (find-sym (first sym))
  2 {:Operator :multiply
      :LHS      (find-sym (first sym))
      :RHS      (find-sym (last sym))
     }
  ;; default case
  {:Operator :multiply
   :LHS      (first sym)
   :RHS      (replace-sym (rest sym))}))

All well and good. The only remaining piece in the puzzle, is find-sym. This is basically a lookup function into an Assoc-list. Henderson defines a map/list as an Assoc-list if we can look up the symbol dimensions into it. Easy.

;; Defines the variable 'v' as an Atom.
(def Assoc-List
  {:v {:Type :variable
       :Symbol "v"
       :Dimension {:M 0
                   :L 1
                   :T -1}})

;; Simple lookup into the Assoc-List map.
;; Returns UNDEF if not found.
;; We can have the user populate this map as an earlier exercise.
(defn find-sym
  [sym]
  (let [sym-key   (keyword sym)
        sym-value (sym-key Assoc-List)]
    (if (not-empty sym-value)
      sym-value
      "UNDEF")))

Yay!

Lo and behold! We now have a working program which can dimensionally analyze equations for us. We can then define a function to aggregate all of the above into a single API exposing the required functionality. Maybe for a later time ;). Let us see it running:

user=> (def equation (physics "2u+3at"))
#'user/equation
user=> (insta/transform parse-tree->schema-def equation)
{:Operator :add, :LHS {:Operator :multiply, :LHS {:Type :constant, :Symbol 2, :Dimension {:M 0, :L 0, :T 0}}, :RHS {:Type :variable, :Symbol "u", :Dimension {:M 0, :L 1, :T -1}}}, :RHS {:Operator :multiply, :LHS {:Type :constant, :Symbol 3, :Dimension {:M 0, :L 0, :T 0}}, :RHS {:Operator :multiply, :LHS {:Type :variable, :Symbol "a", :Dimension {:M 0, :L 1, :T -2}}, :RHS {:Type :variable, :Symbol "t", :Dimension {:M 0, :L 0, :T 1}}}}}
user=> (def parsed-result (insta/transform parse-tree->schema-def equation))
#'user/parsed-result
user=> (s/validate Expression parsed-result)
# Returns successful!
user=> (dim parsed-result)
{:M 0, :L 1, :T -1}

P.S: I am a novice Clojure programmer. Do suggest improvements in my coding style and the above code itself. It is hosted on my Github.

References:
Peter Henderson’s book on Functional Programming Application and Implementation

Software Engineering in Health Systems Workshop 2016 – Where Engineering is Not Enough

This weekend I had the opportunity to attend the Software Engineering in Health Systems (SEHS) Workshop in Austin Texas. Through a series presentations, panels, and group discussions on a range of health care systems topics two main themes seemed to arose:

  1. Health care is an information intensive domain in which contemporary software engineering techniques are ill-equipped to operate. Systems engineering techniques seem to handle the problems more easily (though significant work is required).

  2. Though health care professionals and engineers seem to collaborate in these environments, they each fail to fully grasp the perspective of the other; this results in “half-way” solutions which, address an immediate need but are brittle and ultimately result in system instability.

The good news, in my opinion, is that the solutions to these problems do not require tremendous innovation in science or engineering. Instead, they require the correct allocation of human resources to the right problems.

In fact, allocation of resources (including humans) is one of the three main views (others are static and dynamic structures) from which you can consider the architecture of a software system; these problems are architectural in nature.

I think, the solution to (1) is to bring software engineers to tables in which systems decisions are made. Kevin Sullivan made this point beautifully in one of the workshop discussions. He indicated that software engineers are trained to think about systems issues already; if they can be part of the discussion about the socio-technical systems in health care then we will begin to see progress.

(2) is a matter of education; interestingly though, I think, the problem cannot be addressed by another classroom course in a standard undergraduate education. In my experience, the knowledge required to understand the complexity of the operations being carried out in both health care and software engineering requires a more tacit type of knowledge one can only acquire by activity being part of the environment. Joachim Reiss, VP of R&D at Siemens Health Care, made this point exquisitely clear in his keynote presentation: we need to get engineers into hospitals and health care providers into design labs.

Of course, neither of these problems are trivial, they involve complex social and political structures that make change, especially social change, tremendously difficult. This is simultaneously a great and maddening realization. It is nice to know that we are very capable effecting massive change in health care systems through social change, however it is clear from history (and recent events) that social change can be extremely difficult to enact on a large scale. Sadly, as much as I wish we could, we cannot rely strictly on engineering to get solve this problem for us.

Finding Twitter Conversations

A Twitter conversation is a series of tweets that usually revolve around an initial tweet. The initial tweet is defined as the root of the conversation, and the rest of the tweets in the conversation would “stem” or “branch” from the root. Finding the root is relatively simple. For example, to find the root of a conversation (or conversations) about “warfarin” a keyword search is performed. Any tweet containing the word “warfarin” would be considered to be a conversation root. The trick is how to define and find “branching”. Branches are tweets that can form a coherent conversation when attached to the root.

Natural Language Cues

One technique is to look for linguistic and textual cues to indicate possible branches. Possible initial branches are replies to the root. Replies to the root tweet cannot be found directly. They can only be found through replies to the user that tweets the root (root user) through the @ symbol (or other variations). So a @root_user string would indicate a reply to the root user, which could be a reply to the root tweet or a reply to another one of the root user’s tweets. Also, any @ in the root tweet indicates that the root user is replying, to another user, with the root tweet. These users being replied to could also potentially write tweets related to the root tweet (reply-to tweets). Figure 1 below illustrates the different components of a twitter conversation.

Example Tweet Conversation-1
Figure 1: Example Twitter Conversation

To find whether a reply or a reply-to tweet is a branch of the root tweet (and therefore part of the conversation), further natural language analysis is required. Such tweets must form a coherent conversation with each other and with the root. The key for finding coherent conversations is to find terms that are possibly related that would act as links between the tweets. For example, a drug and an adverse drug event (ADE) could be related terms. Using such links as a starting point, dependency parsing algorithms would do the heavy lifting of connecting individual tweets into a conversation. The shortest dependency path measure [Liu and Chen 2013] (see figure 2 for example) could be used to find terms with a high probability of being related terms.

ExampleTweetSDP
Figure 2: Example of a Shortest Dependency Path for a part of the Last Tweet in Figure 1

Tweet Networks

Reply and Reply-to tweet network extend beyond the root tweet as each reply and reply-to tweet to the root can have its own reply and reply-to tweets. Reply and Reply-to tweets can form interesting networks. Well connected users or central users in such a network can be very important parts of a conversation and deserve special attention. These networks are directed graphs where each node would represent a user, and each node would be a tweet pointing from the sending user towards the receiving user. There are two main measures of centrality for each user in a reply and reply-to directed user network. Closeness centrality measures the average shortest path between a user and all the other users in the network. Betweeness centrality measures the number of shortest paths where a user is in the shortest path between two users. It is usually expressed as a probability that a certain user will be in the shortest path between any two users in the network. Both measures point out users that maybe central and important part of a conversation. The tweets of these users could form the main parts of a conversation. Figure 3 illustrates one example of these networks.

Example Tweet Network
Figure 3: Twitter Users’ Tweet Directed Graph Network

References

Liu, Xiao; Chen, Hsinchun; AZDrugMiner: An Information Extraction System for Mining Patient-Reported Adverse Drug Events in Online Patient Forums, ICSH 2013, LNCS 8040, pp. 134-150, (2013)

Debriefing on the Apple-FBI Debacle: The Aftermath

The Event

As many of you may have heard, in February the FBI requested that Apple make a modification to their systems to allow them to have access to an encrypted iPhone — which swiftly invoked the ire of the security community. Many experts asked why the FBI would even ask such a “ludicrous” and “short-sighted” question. They questioned the FBI’s understanding of basic encryption principles and quipped that the decision must have been made by a politician since no security expert would have made such a request. They further pointed to the past revelations about Snowden’s leaks and how many government associations have recently (and continue to) abuse the powers they have been given in this area.

Many worried that such a request would set a precedent, and even the FBI director admitted that it most likely would.

Apple responded in kind and denied the request. This signaled the start of significant political posturing by both players to garner support for their cause. The security community and many industry leaders quickly sided with Apple.

Ultimately the FBI elected to contract a third party who used an unknown exploit to gain access to the device. Both parties ceased their posturing and stood down.

The Aftermath

Three months later we observe the resulting fallout of this debacle.

Before this event even happened, trust in the U.S. government was near an all-time low, and this event did little to instill further confidence.

Beyond this, some have claimed that Apple “won” the battle as they did not provide the requested modifications — The FBI found another way around that did not require a court decision. This may be premature.

What Apple did do was prevent (for the time being) the FBI from having easy access to any iPhone they want in a systemic manner. This is important since the previous abuses of power stemmed from easy and systemic access to data (with or without warrants.

Prevention of such access requires the FBI to treat each case as its own separate issue:

  • Get the physical device
  • Get a warrant to gain access
  • Breach that specific device
  • Search for the evidence they need

The difference is in the difficulty and the case-by-case nature. It is not easy to do this process for 100,000 devices at one time. And it should not be. That is beyond the scope of a single case and some would say borders on surveillance-level.

As with most things, there is no black-and-white. There is always a trade-off between privacy and security, and it is not always clear which path should be taken.

What must be considered is that each breach of privacy for sake of security is a one-way transition. By breaching privacy through court-order, or through legislation, a precedent is set. This makes subsequent and wider-scope breaches easier. Alas there is an opportunity cost to our drive for total security. By resisting the FBI, Apple has resisted the erosion of privacy on a systemic level. Access to the device was still gained, but not through systemic means.

Furthermore, this event has heightened the awareness of privacy in the general populace. This can only be taken as a good sign. One should take their privacy seriously and be critical of any institution that intends to infringe on it. Not only did individuals take measures to secure their privacy, but other institutions did too. WhatsApp decided to strengthen their encryption as a direct result of the Apple case.

Unfortunately, the lack of precedent also means that this particular conflict may arise again in the future. And that future may include some event that spurs the government to forgo long-term thinking and rationality for a short-term solution to their immediate problems… Because, that hasn’t happened before.

All-in-all the result is favourable to both parties. Apple is enshrined as an advocate for privacy and maintains its position without alienating their consumers, while the FBI was able to ultimately gain access to the device in question.

As for the future? No-one knows.

Requirements Documentation with Gherkin

Blog: Requirements and Gherkin

We have been kicking off our ePRO study (more on this study later) and there are a range of partners in the study. We needed to quickly progress to a level of detail in requirements to ensure common ground even before we get IP and ethics in place.

For this work, I have chosen to use Gherkin, trying to drive some behaviour driven development (BDD). We are not expecting that we will use Gherkin throughout the development — each of the partners has their own methods and tooling and Gherkin / Cucumber is not used. Gherkin does provide a clear and accessible convention to structure requirements as features and scenarios.

Further, as domain specific languages are becoming a popular topic in our lab, it seems fitting to use another one for requirements.

Gherkin Materials

For those not familiar with Gherkin, several people have written about it better than I could here. For example:

  • The source on GitHub: https://github.com/cucumber/cucumber/wiki/Gherkin
  • Cucumber.io: https://cucumber.io/docs/reference
  • Behat has a great description with solid examples: http://docs.behat.org/en/v2.5/guides/1.gherkin.html
  • And there is the cucumber book: https://media.pragprog.com/titles/hwcuc/gherkin.pdf

Gherkin and ePRO

We have completed several drafts of the requirements iterating with our partners. The requirements have grown from a half dozen high-level features to 20 pages of more detailed features and scenarios. We have found it useful to organize the features around several actors and the partner / data streams. This works well, given how the project is structured as well as how the ePRO intervention was conceptualized. These requirements now include details such as example data to help the team members. Gherkin’s structure was well suited for this kind of work and the organization by actor was definitely useful.

Gherkin: Business Readable? Business Writable?

I appreciated a post from Martin Fowler reflecting if domain specific languages should be business readable or business writable. In our project, so far, it has been used in a business readable manner. That is, I have been the primary author and I invited comments from the team. Business owners are providing feedback, questions, changes, and notice areas where there is need for more detail. My role in our team is often a bridging role – I span clinical and information science – and that seems to be a good fit for me using a language like Gherkin.

Next Steps

This work will feed into some user centred design and testing cycles for ePRO that I am very much looking forward to starting.

Thoughts on IBM Watson

I recently attended a presentation by John Boyer, a Chief Architect on IBM’s Watson project. The presentation was about the Watson technology that IBM has, and continues to, develop. This was the second Watson presentation I have attended, the first was at CASCON 2014 where an IBM VP gave a keynote lecture. I must admit, after both presentations I walked away feel rather disappointed.

I must preface the remainder of this post by stating that this represents my opinion based on the information I have gathered through these presentations and reading the news. I could be entirely wrong.

The feeling of disappointment stems from what I thought Watson was. Before attending these presentations I had watched Watson play jeopardy and read multiple news stories about this fantastic new technology, it felt like Watson was a huge step forward, and it was going to meet all of our analytics needs for all of eternity and make me coffee every morning. I expected these presentations to dazzle and amaze me with new and exciting machine learning techniques or to provide some deep philosophical insight regarding AI. After all, IBM must some “secret sauce” that makes this tech work, right?

In retrospect, it is not surprising that I was disappointed, I set myself up for it. Anytime someone attributes a “god-like” status to anything, technology or other, they will inevitability be disappointed in some respect.

As far as I can tell, Watson is an aggregation of the numerous machine learning and data mining techniques developed over the last 50 or so years. Of course, IBM has improved and tuned these approaches. However, in the end they are relying the same basic approaches that the rest of the machine learning and data mining community rely upon. These have been combined in to support exactly one task: answering questions.

The entire purpose of Watson is to answer the questions based on the (potentially vast) amount of information available. This approach has allowed IBM to focus their efforts, there are numerous tasks that Watson can’t, and never will be able to, do, simply because it wasn’t designed for them. Watson isn’t some fantastically brilliant AI using some revolutionary algorithm or technique, it is just a (really big) program, and an excellent example of what contemporary machine learning is capable of at scale.

Watson is, in my opinion, a feat of engineering. Given that it uses the current state-of-the-art machine learning techniques, and not some IBM secret sauce, then it is an exceptionally well executed version of a machine learning tool. A tremendous amount of engineering effort has gone into bringing together different machine learning approaches and unifying them into package that can be deployed in a breadth of application domains (health care, finance, business management, security etc…). The complexity of a single domain, if not managed properly, is often enough to cripple most software projects; Watson takes this complexity in stride and provides value to its customers.

I think the most interesting thing about Watson, isn’t Watson itself but its application to different domains. IBM has taken approaches that have existed primarily in academia and made them accessible, on a huge scale, to the world. The value of machine learning and data mining techniques increases as we increase the amount of data processed. Watson will allow us process tremendous amounts of information, and to really understand the impact of machine learning in different application domains.

Finally, I think Watson provides us with the opportunity to understand the limitations of current machine learning techniques. What problems with these approaches will manifest at scale? At what point is machine learning no longer viable, how will we know when we reach that point, and what will we do after that?

In conclusion, I am disappointed that Watson isn’t some out of this world piece of technology, I really expected to be blown away. Upon reflection, this appears to have been a naive opinion, life lesson: “there is no silver bullet”. However, Watson is an exceptional tool that will change how we approach problems in numerous domains, but it is just that: a tool.

Deep Work

Over the past few weeks, I have read Cal Newport’s new book Deep Work.

As a tenure track assistant professor there are certain parallels and I found it an interesting read. I think it has relevance to the work of our lab and how and what we are trying to accomplish. I have read Cal’s previous book and quite enjoyed it and so looked forward to this one coming out in January 2016.

Cal’s basic premise in this book is that there are two types of work: Deep Work and Shallow Work.
Deep work comes about as you develop an ability to focus and is a skill that is increasingly important and harder to naturally develop in this connected time. As he says deep work is the kind of work that gets you tenure (or promotions) while not doing shallow work, he somewhat jokingly said, would be what gets you fired. Thus we all need strike a balance between these. Here are more detailed descriptions of the two:

Deep Work

The kind of work that Cal talks about as deep work are the complex tasks that require high skill and high focus. For him, that is theoretical computer science proofs. For me, that would be paper writing, study design, the qualitative analysis and careful synthesis I like to do related to health information systems / health systems and user experience. This is the work where you might experience flow and is the type of work that crumbles with interruptions.

Shallow Work

Shallow work are the interruptions. They may be important and often can be urgent.
Classic shallow work are things like: quick emails, scheduling meetings and often meetings themselves.

Deep Work routines

I like how Cal describes a number of deep work patterns he has seen be effective, from the “every day routine” to the go away, build a stone tower, and stay there for weeks every year to get the deep thinking done. I have used both (well, not a stone tower, but I went away to write my thesis. No clinic, no classes, nothing but 1000+ words a day, repeat until done).

My take aways from the book.

I think it is worth reading, especially if you are a grad student looking to get chunks of work done on your thesis and papers

My clinic work does not fit into deep vs shallow very well. It is a form of deep work in that it requires focus and knowledge and years of training but it is something that I do not get the luxury of focusing on one patient for 90+ minutes. Instead, I’m seeing 30+ patients in a day. It isn’t Cal’s classic deep work but it is not shallow work either. It is a different rhythm for me. I think that clinic work has helped me with the general technique of focusing but not sustaining that focus on a single topic. For that I can look at some of my other practices. I found that I was already doing several things in the book to promote deep work.

  1. I have a weekly / daily routine with deep work rhythms.
  2. I ensure a good chunk of deep work time each day (that is usually first thing in the morning before people are up in my house). I also block of chunks of deep work (often later in the week) and try to cluster meetings and clinical time in other parts of the week so I can do a bit of “mode shifting”.
  3. I have reduced many of the common attention sinks / alerts in my day (e.g. no buzz on my phone for every email).

I found I could do more – and that “more” for me meant focusing on my shallow work.

  1. I cluster meetings where ever I can (but I still have a lot of meetings)
  2. Email: While I have turned off automatic checking and bells and whistles when email arrives, I do check more often (manually) and I am often triggered by the little icon showing how many messages there are to be processed (rarely at zero these days).
  3. Timing: Cal does not get into timing too much but I know for me my best deep work is done early and I will work more to ensuring the mornings have 1-2 chunks of deep work before I get “too shallow” with calls, emails, etc.

My actions:

  1. Email notifications: I will move my mail app off my task bar and turned off the notification badge. That seems small but now I cannot see the 28 emails pending – even though it wasn’t pinging me actively, I would find it impossible to look as I moved between apps.
  2. Meeting my Email: I have fit email into my schedule whenever, thinking it small and squeezable. It’s true it is and that seems to work for days with meetings. HOWEVER, it can distract on days where I want to get into thinking rhythms. If it IS a habit to check email while, say, boiling the kettle, then I’ll be more likely to do that when I am in deep thought and boiling the kettle. Instead of getting a cup of green tea being a quiet movement away from a deep problem it becomes a phase shift. By booking email slots in my day, I can be more conscious of those shifts.
  3. One of the things that has been on my “deep work list” (long before this book) is to take a reading holiday. That is, take work time to get away and take with me a set of books / papers that address a big issue and not let myself be distracted for a large chunk of time. Bill Gates was known to do this and I have been meaning to try this. That will be one of my actions – maybe not for a full week, but at least several days to start.

Comments of Software Design Processes

Repost from Simon’s Blog

Last week I attended a presentation by Brendan Murphy from Microsoft Research – Cambridge. Dr. Murphy presented on his research regarding software development processes at Microsoft. This post contains a summary of the presentation and my thoughts on the material presented.

The discussion focused on the impact of particular architectural and processes decisions. For example, how does the choice of architecture effect the deployment of new source code. Consider a micro-service architecture in which several independent components are working in concert; it is likely that each of these components has a unique code base. Deployment of updates to one component must consider the effects on other components, i.e. a loss or change of a particular functionality may cause problems in other components. Perhaps using a more monolithic architecture reduces this concern however software monoliths are known to have their own problems.

The way that we, as software engineers and developers, manage variability in source code is through branching. Code repositories often consist of a number of related branches that stem from a “master” or main version of the code, branches often contain a specific feature or refactor effort that may eventually be merged back into the main branch. This concept is familiar for anyone who uses contemporary version control methods/tools such as Git.

In his presentation, Dr. Murphy discussed a number of different branching models, some of which have been experimented with at Microsoft at scale. Different approaches have pros and cons and will support different architectural and deployments in different ways. Of course, when considering the effect of different models it is important to be able to measure properties of the model. Murphy presented several properties: productivity (how many often in time files are changed), velocity (how quickly changes/features move through the entire process), quality (number defects detected and corrected in the process), and coupling of artifacts (which artifacts often change together). These can be used to evaluate and compare different branching models in practice.

The most interesting part of the presentation, from my point of view, was the discussion of the current work based on branching models and these properties. In theory, a control loop can be used to model an evolving code repository. At every point in time the properties are measured and then feed into a decision making unit which attempts to optimize the branching structure based on the aforementioned properties. The goal is to optimize the repository branching structure. Murphy indicated that they are currently testing the concept at Microsoft Research using repositories for various Microsoft products.

The main takeaways from the presentation were:

  • Deployment processes must be planned for as much, if not more, than the software itself.
  • Architecture will impact how you can manage software deployment.
  • Repository branching models have an effect on version management and deployment and need to be designed with care and evaluated based on reasonable metrics.
  • A control loop can theoretically be used to evolve a repository such that there is always an optimal branching model.

From my personal experience working as both a contractor and employee the point of planning for deployment is entirely correct. If software isn’t deployed or unaccessible to its users it is as if it doesn’t exist. Deployment must be considered at every step of development! The presentation did cause me to pause and consider the implications of architecture affecting the patterns of deployment and required maintenance. As we continue to move towards architectures that favour higher levels of abstraction, in my opinion, one of the primary concepts every software engineer should embrace, we will need to find ways to manage increasing variability between the abstracted components.

The notion of the control loop to manage a branching model is quite interesting. It seems that we could, in theory, use such a method to optimize repositories branching models, but in practice the effects of this might be problematic. If an optimum is able to be found quickly, early on in a project, then it seems that this is a good idea. However, if weeks and months go by and the branching model is still evolving this might cause problems wherein developers spend more time trying to adapt to a continually changing model, rather than leveraging the advantages of the supposed “optimum” model. However, a continually changing model may also force development teams to be more aware of their work which has benefits on its own. Really this idea needs to be tested at scale in multiple environments before we can say more, an interesting concept nonetheless.

References and Resources