Advice to students – Part 2 of 2

Okay – we’re back (please read Part 1 if you haven’t already) . Let’s finish off with what those last two letters – A and D – represent when it comes to doing a Master’s degree or otherwise being part of the LEAD Lab team.

A is for Advising

Jens, Morgan and the whole LEAD Lab team are here for you when it comes to advice on everything from thesis topics to course selection to job opportunities. Team meetings often feature DesignFests – short presentations by any LEAD Lab member on a variety of topics. If you have pressing questions about your thesis work to date or some other issue around your Master’s process consider whether you can turn them into a DesignFest and present it to the group for their input or meet with Morgan or Jens one-on-one.

D is for Directing

This meaning of directing is not so much directing as in directions on a map – although there is some of that – but rather more like the directing as in the movies. I like this definition from Wikipedia and I think if you just substitute the words ‘successful Master’s degree’ for film and ‘you’ for the ‘cast and crew’ you’ll get what I’m talking about.

“A film director is a person who directs the making of a film. A film director controls a film’s artistic and dramatic aspects and visualizes the screenplay (or script) while guiding the technical crew and actors in the fulfillment of that vision.

The film director gives direction to the cast and crew and creates an overall vision through which a film eventually becomes realized, or noticed. Directors need to be able to mediate differences in creative visions and stay within the boundaries of the film’s budget.” https://en.wikipedia.org/wiki/Film_director

In this analogy the ‘budget’ is the time and energy you have available to you for dedicating to your Master’s program and Jens and Morgan will help you stay within this ‘budget’ while also negotiating “differences in creative visions” and making sure that you accomplish (realize) the end result while staying true to your personal “artistic and dramatic aspects”. Maybe the analogy is a bit of a stretch but I think you can see what I am getting at.

In closing I’ll just say that Jens and Morgan, as principals of the LEAD Lab, want to ‘direct’ Master’s students to help them succeed with as little stress and angst and as much learning and self-discovery as possible. And in fact, as I hope I’ve made clear with this post, all LEAD Lab members are rooting for each other – it’s what makes this small (but mighty) team such a great group to work and study with. Whether you’re new to LEAD Lab or returning after prior studies or employment with the Lab welcome – we’re glad to have you.

 

Advice to students – Part 1 of 2

If you look at the LEAD Lab website you’ll see that LEAD is an acronym representing the words Learn, Engineer, Act and Design. But after nearly four months working here on a co-op work term I’ve come to realize that these letters could also stand for the following attributes that I’d like to share with you in the form of ‘advice to students’. (For a TL:DR version see the graphic below).

This advice is mostly directed at those joining the lab to pursue a Master’s degree (with Dr. Jens Weber and/or Dr. Morgan Price as your supervisors) but could equally apply to those that are going for a PhD or joining in some other capacity – such as a work study student or someone pursuing an interdisciplinary studies course through Health Information Science.

L is for Listening

The people at LEAD Lab really believe in – and value – listening. They want to hear what you have to say and want you to feel free to say it no matter how small the matter seems or where in the lab’s hierarchy you are (or where you perceive yourself to be). On that note, Dr. Weber and Dr. Price both acknowledge that there is a hierarchy though not as rigid, as deep or as articulated as in other environments you might have encountered or worked in. For example, they are both fine with being Jens and Morgan and with you challenging them – respectfully and civilly of course – in discussion when you think there is a perspective they have overlooked or that you disagree with.

And if you find, despite the informal atmosphere of the lab, that you do not want to approach Jens or Morgan directly with a concern or issue I have volunteered – at least until September 2018, to act as a kind of ombudsperson to take things forward anonymously to either or both of them. You can reach me via email at szmartin@uvic.ca and arrange to talk to me face-to-face at your convenience (if you wish) and with confidence that I will keep your identity secret.

E is for Examining, Exploring, Encouraging and Enhancing

I know – that’s a lot of Es! Let’s take them one at a time.

Actually the first two Es – examining and exploring – are an attempt to capture what Iryna Davies – a previous Master’s student supervised by Jens and now a PhD student at the University of British Columbia’s Experimental Medicine program supervised by Morgan – mentioned at a recent LEAD Lab Team Meeting. That is that when students transition from undergrad into a Master’s program, especially if they are on the younger side and may not have had extensive work or life experience, they can feel a bit lost. They go from an academic environment where there are firm deadlines, structured coursework and final exams looming at the end of the road to one littered with shiny, interesting topics and a confusing but enticing network of paths that one could follow on the way to a completed thesis.

Jens and Morgan want to try to assist students to navigate this new environment by assisting them – through one-on-one discussion, email or through group back-and-forth at Team meetings (if appropriate) – to both examine the intersection of their individual interests and what could make a good thesis question/topic and to explore widely but within reason.

The within reason part segues into the next E – encouraging. LEAD Lab wants to be a place that encourages students to find that balance between examining and exploring that allows them to stay on track with progress towards their eventual degree completion while still letting them learn about new avenues of interest and topics that they might want to return to post degree.

The last E – enhancing – was brought to mind by the comments of Adeshina at our last LEAD Lab team meeting. Adeshina, who hails from Nigeria and has a B.Sc. in Computer Science from the University of Ibadan, is doing his Master’s at UVic while also working as both a Research Assistant for a LEAD Lab project with a local healthcare IT company and a Graduate Teaching Assistant for the department of Computer Science. He talked about his strategy of spreading out the necessary coursework for a Master’s over the full time of the program and enhancing his learning and thesis development by choosing courses that supported and strengthened his knowledge in topics relevant to his thesis. Adeshina also stressed the importance of having a carefully laid out plan for the entire time of your graduate studies with well-defined milestones that have been agreed upon by both the student and supervisor(s) – this helps to keep students on track and accountable for their progress and makes it easier to see early on if a student is getting into difficulty.

This is getting pretty long so I’m going to suggest that we cover the letters A and D in Part 2 of this post.

Clinical UI Design Sprint – Self Challenge

Starting Some Design Sprints

I have decided to start putting into my schedule more “design sprints” as a way of flexing my UI/Ux design muscles in prep for future work.
I came from a UI design background, so I am getting back to some roots through this activity.

Like any good project, I have defined some constraints for a sprint. Here are my constraints:

  1. I have one week for a sprint with a topic of interest to me (does not have to relate to a project).
  2. I must complete 1 pomodoro sprint daily (if you do not know those, they are essentially a 25 minute block of focused effort followed by a rest)
  3. I must complete 10 pomodoros through a work week (Monday to Friday)
  4. I will document my work.
  5. I ship the design on the weekend.

I am focusing on healthcare IT design considerations, such as portions of the EMR and personal health services.

I will use methods that I can do myself, so I do not have to coordinate ethics, users, etc. This is meant to encourage with as few barriers as possible and, let’s be honest, waiting for recruitment for a focus group is not a pomodoro.

Methods I am using include

  1. Simple web research
  2. Mood boards
  3. Personas
  4. Activity modeling
  5. Use Case Mapping
  6. Wireframing / Prototyping

I have picked these because they will fit into pomodoros and they have a visual component so they might be interesting to share online.

Anyone is welcome to join me on a sprint.

My first attempt was a sprint on the problem list in EMRs and I documented the work in a little video.

It was interesting and good to do. I will be making more space to do these in the coming months as I find them recharging.

On Symbolic Execution and more

Wait, what?

It is exactly what it sounds like.

Symbolic execution implies assumption of abstract symbolic values for variables to determine flow behaviour of the program.

This concept was put forth in a paper titled “Symbolic Execution and Program testing” by James C. King. Rather than using the data objects provided by the programming language for storing values, we use arbitrary symbolic values for computation. You can see that the normal computation is a special case of symbolic execution wherein we “concretize” the values. Let us take a look at some assumptions made by the author.

Assumptions

  1. Computer programs deal only with integers.
    This assumption is fairly valid and is almost harmless in many analyses. Although a lot of programs work with float and double values, to reduce the search space, we will assume integral values only. We will also ignore integer overflows(in the case of 32/64 bit integers in C, for example).

  2. The execution tree resulting from many programs is inifinite.

  3. Symbolic execution of IF statements requires theorem proving which is mechanically possible.
    This is a very critical assumption and forms the basis of all analysis performed on the execution tree.

Why symbolic execution?

It is the “workhorse” of modern software testing. As with my previous blog post, Symbolic execution saw a rise in interest and research due to increase in CPU speed and the ability to execute things parallely. We will look at how exactly this is useful. Also, it is one of the very few ideas which has made it to the industry and practice, breaking out of the academic bubble`.

How does it compare to the current techniques of static and dynamic analyses?

Symbolic Execution provides a some advantages over both these approaches. Let us look at them:

In comparison to (dynamic)testing it allows us to explore a more broader search space. We can make sure that the program is tested on almost all possible inputs before it is shipped.

Static Analysis has seen a lot of growth in the past few years. We have have better approaches for applying techniques on binaries and the source as well. But one of the fundamental issues with applying such algorithms is that they can point it out to you that there is a possibility of an error, but can never predict/output the input which can trigger the exploit. Symbolic Execution scores a +1 in that regard.

I think the above two reasons are enough to peek under the hood?

Control Flow graph and understanding the semantics of the flow

As wikipedia defines it,

A CFG is a representation using graph notation, of all the paths that might be traversed through a program during it’s execution.

A symbolic execution engine works on this exact representation of the program. Instead of variables having numeric(concrete) values, we choose one of the infinite abstract values and give it to the variable. Let us check out an example:

int main() {
int i;
scanf("%d", i);
if (i==5) {
  printf("Error.");
} else {
  printf("Yay.");
}
return 0;
}

Now we can draw a CFG of the above program as follows:

From the above graph we understand that at the IF conditional statement the graph branches in two. This means that depending on the conditions imposed on the various variables in the program in the IF statement, the program is going to behave differently. In this particular case, if the value of the variable i is 5, the control flow is going to be leading to the true branch and to the false branch otherwise. The rules for evaluation of these statements must be extended to include symbolic values as well.

Understanding the generation of the path equation

A program state is usually defined by the value of its variables(global and local, ofcourse) during normal executionl. But King, in his paper describes the generation of a path equation which has to be included in the program state. A path expression or a path condition(pc) is always initialized as true. As the program follows a particular execution path, more and more conditions are conjoined to the pc. Each statement thus contributes to the path expression(unless it’s a trivial assignment).

An interesting situation arises for the IF condition. A trivial case for this would be if the condition evaluates to one of the two boolean variables. If it does, we continue with that branch and do not consider the other one. But in case we cannot determine at symbolic time what the condition will evaluate to, then we need take a look at both the branches. This happens due to the involvement of the symbolic variable in the equation.

We now fork the execution into two branches. One, where the true branch was taken, and another where it was not. This will basically lead us to generate two expressions. One will be of the form: pc AND (i==5) and another pc AND (i NOT= 5). Now symbolic execution for the rest of the program analysis continues with both these states in memory.

Let us see the generation of the path expression of the above C program:

  1. Statements 1, 2: A fresh new symbolic variable is assigned to i. Let us call it alpha. Current pc: true
  2. Statement 3 : An if condition is encountered. Since it cannot be evaluated, we generate two conditions. pc: true AND (alpha == 5) | true AND (alpha NOT= 5).
  3. Statement 4 : Since this is the last statement in the program, we have already evaluated all the branches, and hence the expression here would be the logical OR of both the expressions in memory.

Thus, after evaluating the entire (execution) tree in this manner, we will end up with certain conditions on the variables for each leaf node. This is a completely exhaustive search mechanism since we have assumed the values to the totally abstract and not provided any information on what they could be(apart from being integers).

Below is the tree representation:

This is a very simple example but it gets the idea through. Although things get a lot more messier with assignments and goto statements, the concept remains the same. The material above is enough to understand basic literature on the topic.

Theorem provers and SMT solvers

Well, now what? We have the path expression of the entire program. Assuming we want to generate test cases for the ERROR branch, we need to solve the boolean expression. Enter, SMT Solvers! SMT(Satisfiability Modulo Theory) Solvers try and solve decision problems. Since, throughout the program we are applying constraints on each variable, these are also known as constraint solvers. Microsoft’s z3 is one of the most used(in production) theorem prover freely available online.

They take in an expression containing symbolic variables and constraints which are imposed upon them. They then try and give us a result depending on the satisfiability of a problem. Sometimes it may give us our result in seconds whereas on other times it may tell us that the problem is unsolvable. But also, solvers fail to deduce either of the facts since it is a NP Complete problem. The algorithms used for these solvers internally, is a matter of a completely different blog post altogether but they basically try and combine multiple expressions from the given path condition and see if they can find a valid solution. Since the search space is very high(unless otherwise bound in a constraint), it takes a lot of time. This also, to certain extent depends on the number of conditions but it’s performance decreases as the number of expressions grow.

Are we going to have a demo or what?

In Clojure? Sure! We are going to try and solve a Project Euler problem 9. The problem statement is as follows:

A Pythagorean triplet is a set of three natural numbers, a < b < c, for which,
a^2 + b^2 = c^2
For example, 32 + 42 = 9 + 16 = 25 = 52.
There exists exactly one Pythagorean triplet for which a + b + c = 1000. Find the product abc.

Pretty easy to just iterate over all the possible values over 1000. But we will try and do something new! \o/

Okay, now let’s say that a, b and c are symbolic variables. We will use the clojure.core.logic library for this.

(ns clj.p9
  ;; conflicting operator
  (:refer-clojure :exclude [==])
  (:require [clojure.core.logic :refer :all]
            [clojure.core.fd    :as fd]))

(defn solution
  ;; Let us say we take as input the sum of the three pythagorean variables and input the result.
  [sum]
  ;; We will have the result in a vec as [a b c].
  ;; Since we want the product a*b*c, let's just multiply them together.
  ;; We use the function first since the two results provided are:
  ;; [a b c] and [b a c] with c being the hypotenuse. Hence we choose one of them.
  (reduce * (first
  ;; Here we go! `run*` implies that we want the computer to compute
  ;; all the possible values for the expression `q`.
  (run* [q]
  ;; Declare 3 fresh symbolic variables for a, b and c.
  (fresh [a b c]
  ;; The solution should be a vec of values of the variables a, b and c.
  (== q [a b c])
  ;; Now we know that all that all the number have to be distinct.
  ;; We hence use the fd/distinct to add that constraint.
  (fd/distinct [a b c])
  ;; Another contraint we can derive is that since the sum of two sides is always &gt; than the third,
  ;; each side has to be lesser than half of the `sum` value.
  ;; This can reduce our search space by quite a lot. fd/in provides us an option to specify the range.
  ;; (fd/interval [start] [end]) is the actual numeric range.
  (fd/in a b c (fd/interval 1 (/ sum 2)))
  ;; We now add the given constraints on the variables.
  (fd/eq
    ;; sum = a + b + c
    (= sum (+ a b c))
    ;; c^2 = a^2 + b^2 (since they are pythagorean triplets)
    (= (* c c) (+ (* a a) (* b b)))))))))

EZ PZ. Let’s run this!

(time (soln 1000))
"Elapsed time: 2345.989466 msecs"
31875000

There we have it! Solved in 2.3 seconds. Although this was fairly simple, but we learnt a lot on the way. core.logic is not as highly optimized as z3 but is fairly fast. It is made for a relational algebra which is a superset of symbolic analysis. z3 has a nice Python API to interact with it!

Issues involved

Current research trends include the inclusion of analysis of source and raw binaries as well. Program testing can be approached in an entirely new way if the inputs to methods(maybe?) are considered to be symbolic and we then fuzz test the over the entire search space. It is although very important to realize that there are a lot of problems which can come up during practical implementations:

  1. Path explosion: If there are a lot of branching statements in the program, there will be generation of a new path expression at each branch hence creating a new instance of the symbolic runtime increasing the number of paths. This is a very serious problem and requires smart inputs from the user to prune certain branches off.

  2. Environment interactions and system calls: A symbolic engine will find it very tough to interact with the underlying OS or the runtime environment. This can hinder a lot of core systems testing. The program may take decision based on the return value of a particular read statement or write to a particular file and then taking decisions. There is still ongoing research to find implement a virtual model of the underlying stack for the engine to interact with.

A thought experiment: Symbolic Execution in Healthcare?

Medical systems need to completely fool proof and verified before they are sent to production. Symbolic execution and analysis can be absolutely critical in such an analysis. Like normal programs, even though certain inputs are completely pointless(for example providing a negative number to a square root calculator) it is important that we implement correct error handling.

Although the above is practical, it would be more interesting to apply this logic to the human body directly which is… let’s just say it’s a though experiment. Some of my thoughts related to the ideas below stem from the discussion here. Below are some assumptions I make:

  1. Since the human body is made up of a finite number of atoms, at any point in time there is a set of values which describe the entire body state(Assuming we manage to deconstruct the brain).

  2. Although huge, we have a model of the human body which can be provided to a computer to generate an approximate system which will emulate the body internals and reactions to stimuli. Although far-fetched, we are slowly inching toward such a goal.

  3. The computer we imagine is fast enough to perform calculations on such a scale.

Imagine medicine in such a scenario. Assume a medicinal drug to be a symbolic variable and the program to be the human body itself. The medicine can be thought to be an input to the body and hence the system should be able to track the breakdown of such a medicine in the body. The IF conditions in this case would represent exsiting patient conditions and his body state which would lead to different reactions in the body. Once we have all the leaf-nodes of a graph of such a system, the resulting information can be highly infomative in providing crucial healthcare and desigining of drugs, since we will have data regarding each breakdown of the chemicals. Drug design can be highly benefit from such a scenario!

The possibilities are endless. For now, let me get back to ValidIS.

References!

[1] Symbolic Execution and Program Testing, JC King.
[2] Lecture notes on Symbolic Execution, University of Maryland.

Solving the trivial problem of dimensional analysis in Clojure

Functional Programming

Functional Programming (or fp as it is often called) is a paradigm of programming which aims at describing what we want to do rather than how we want to do it. The entire universe of fp revolves around the concept of pure functions.

A pure function is a function where the return value is only determined by its input values, without observable side effects.

Imagine if you will, a conveyor belt which takes in raw material as input. There are a lot of tools which process the input as it goes along, eventually producing the desired result. You can consider the concept of functional programming to be similar to the above situation in a number of ways. The conveyor belt can be thought of as the program with each of the tools being a pure function(does not produce any external side effects, depends only on input). The only difference is that on the belt, the original resource is modified in place, whereas in the program, the data is worked upon and a new instance is passed on to the next function.

Although fp has been around for a while, due to the advent of multi-core systems, big data and distributed computing it is seeing a resurgence in terms of interest in the programming community. It is backed by very strong literature and there is a lot of ongoing research on the same.

Some of the principles that fp operates by are as follows:

  1. Immutable Data structures: This considerably reduces the resources we spend on memory allocation and storage. So, how do we program with such an impairment? Rather than passing data around, we can operate on the data and create a new instance. This is still more efficient than modifying it in place. It allows the compiler to make certain assumptions that are unsafe in an imperative language.

  2. Higher Order functions: We can consider them as functions which take in other functions as arguments or return them as results. The idea is derived from lambda calculus. One of the best examples of a HOF is the function map which takes in a function as an argument and maps it to the list of elements provided as the second argument. Since each function application is independent of another(assuming pure functions) there is a lot of scope for improving the approach by running the same program on a multi-core system. Each of the results can then be aggregated using a function such as reduce.

  3. Seperation of Functions and Values: To an OO person, the concept of encapsulation is very pivotal while programming. This allows the class to have an internal state which can be modified by calling class methods which operate on these values. In fp, both of these entities are clearly seperated.

  4. Statelessness: Since all the functions are pure they have no idea of what has happened or what is going to happen to the input it receives. It only cares about operating on the data and producing the output. This fact combined with immutability is extremely useful for unit testing since each function has a specified set of duties it has to perform and in doing so, it will not affect any external state/variables. The state of the program is kep local and methods(or functions) are completely pure in the fact that they do not hide any abstraction.

Clojure

Clojure is a dynamic, general purpose programming language, combining the approachability and interactive development of a scripting language with an efficient and robust infrastructure for multithreaded programming.

Rich Hickey is a genius. Clojure sits on top of the JVM, leverages Java’s type system and supports calling Java code from within. Clojure has a set of internal data structures such as maps, sets, vectors and lists. It also supports lazy sequences, immutability and persistent data structures. It also boasts of a typical Lisp syntax(which can be a little intimidating for the novice programmer).

Why am I doing this?

I was inspired by the solution to the problem as given by Peter Henderson in his book on Functional Programming and its applications. It is described in a very general manner and I thought it would be interesting to implement it in Clojure using a few of my own variations.

Why Clojure?

It’s awesome. It’s functional and thinking functionally gives you a way to come up with interesting solutions to common problems encountered during programming.

Cutting to the point

Dimensional Analysis is a very simple method to check if a physical equation is valid or not.

The dimensions of a quantity refer to the type of units that must be used in order to obtain the measure of that quantity. The type of units used are the 7 fundamental units which cannot be described with the help of anything else. We will be looking mainly at equations which involve quantities made up of Mass(M), Length(L) and Time(T).

We will assume that the equation is linear in all the physical variables and operators used are that of multiplication, division, addition and subtraction. Although this makes things a lot simpler than they already are, it will help us set up a basic framework for solving such a problem.

Understanding the problem

We will be trying to solve the following problem:

Assume that we have a set of variable quantities v1, v2, etc. each with a unique single alphabet symbol, and a given set of dimensions a1, a2, etc.

We will be trying to determine two things:
1. If the equation is dimensionally correct.
2. The dimension of the resulting quantity.

Let us assume a very straightforward model of an equation.

  1. We will consider that the equation is made up of Expression(s).
  2. Each Expression has a Left Hand Side(LHS) and a Right Hand Side(RHS) with an Operator acting upon them.
  3. The most fundamental quantity in the equation will be an Atom. An Atom can be a constant or a variable.
  4. Each quantity has a Dimension where we mention the exponents to the base quantities.

We will be using schema for data descriptions. schema provides us to functions validate the data and find out if it conforms to the given schema descriptions.

;; schema is required as 's'.

;; Dimension will contain the exponents to M(Mass), L(Length) and T(Time).
;; With s/Int, we specify that the values have to be integral.
(def Dimension
  {:M s/Int
   :L s/Int
   :T s/Int
   })

;; An Atom will contain a type specifying if it is a variable or a constant.
;; Let us also have the symbol of the variable, in the data definition.
;; Most importantly, the dimension of the quantity has to be mentioned as well.
;; As you will observe, the dimension is described in terms of the earler schema.
;; This will help in Atom validation.
(def Atom
  {:Type      (s/enum :variable :constant)
   :Symbol    s/Str
   :Dimension Dimension})

;; Let us also define the dimension for a constant quantity.
;; Since it is a constant it will not have exponents in any of the quantities.
(def Constant
  {:M 0
   :L 0
   :T 0})

;; Since we are assuming a simple case, let us say that the operations can be addition,
;; subtraction, division and multiplication.
;; The model is so designed that it can be easily extended to other operations.
(def Operator
  (s/enum :add :subtract :divide :multiply))

;; We will now describe the definition for Expression.
;; An expression can be described in a recursive manner.
;; The LHS and the RHS both can be Expressions or Atoms. schema provides a nice way to do this.
(def Expression
  {:LHS (s/conditional #(= nil (:Type %)) (s/recursive #'Expression) :else Atom)
   :RHS (s/conditional #(= nil (:Type %)) (s/recursive #'Expression) :else Atom)
   :Operator Operator})

;; We first check if the data in the LHS quantity has a Type field.
;; If it does, we know that it is an Atom, and we direct the checker to validate it for us.
;; Instead, if it does not, we can check for an Expression instead.
;; This can definitely be optimized but I thought this was one way to do it.

Coming up with functions for dimension calculations

Now, that we have the basic definitions sorted out, we can look forward to implementing some functions for checking dimensional correctness.

Assume a function which takes in an Expression or an Atom. We need to find out if the given expression is dimensionally correct. Let us divide the possible scenarios into cases such as:

  1. Input is an Atom. We can then return the dimension of the quantity.
  2. Input is a constant. We can then return Constant which is basically a dimensionless quantity.
  3. Input is an Expression with Operator :add. In this case, we need to check if the HS and the RHS have the same dimensions. If they do, we can return dimension of either. If they do not, we then return an undefined value (say UNDEF).
  4. Similar is the case for subtraction as well.
  5. Input is an Expression with Operator :multiply. We don’t need to check anything for this, but we return a sum of the dimensions of the two quantities involved.
  6. Input is an Expression with Operator :divide. Again, we return the difference between the dimensions of the numerator(LHS) and the denominator(RHS).
;; A generic undefined value. We can use this to convey that the equation is dimensionally incorrect.
(def UNDEF
  "INCORRECT")

;; Used to calculate the dimensions of the reulting quantity after the product of the two inputs.
;; If either of the quantities is UNDEF, this function returns UNDEF.
(defn product
  [dim1 dim2]
  (cond
    (= UNDEF dim1) UNDEF
    (= UNDEF dim2) UNDEF
    :else {:M (+ (:M dim1) (:M dim2))
           :L (+ (:L dim1) (:L dim2))
           :T (+ (:T dim1) (:T dim2))}))

; Used to calculate the dimensions of the reulting quantity after the divison of the two inputs.
;; If either of the quantities is UNDEF, this function returns UNDEF.
(;
(defn ratio
  [dim1 dim2]
  (cond
    (= UNDEF dim1) UNDEF
    (= UNDEF dim2) UNDEF
    :else {:M (- (:M dim1) (:M dim2))
           :L (- (:L dim1) (:L dim2))
           :T (- (:T dim1) (:T dim2))}))

;; Main logic function. Takes in an expression as an argument.
;; This function gives as output the dimension of the resulting quantity,
;; based on the Expression map provided to it.
;; If the equation is dimensionally incorrect, it will return UNDEF.
(defn dim
  [expr]
  (let [lhs           (:LHS expr)
        rhs           (:RHS expr)
        operator      (:Operator expr)
        expr-type     (:Type expr)]
    (cond
      (= expr-type :variable)  (:Dimension expr)
      (= expr-type :constant)  Constant
      (= operator :add)        (if (= (dim lhs) (dim rhs))
                                 (dim lhs)
                                 UNDEF)
      (= operator :subtract)   (if (= (dim lhs) (dim rhs))
                                 (dim lhs)
                                 UNDEF)
      (= operator :multiply)   (product (dim lhs) (dim rhs))
      (= operator :divide)     (ratio   (dim lhs) (dim rhs)))))

Setting up grammer for parsing the equation

Now that we have the general setup ready, we can go ahead and worry about actually taking user input and parsing it into the given Expression format. For this, we will be using the instaparse library. It allows you to specify your own grammer and builds a tree around the same.

Let us consider the simplest linear equation we will be looking to solve:

v = u + at

Newton’s first law of motion. Very simple to understand. But parsing it might be a pain. To make things slightly more tougher, let’s say that the quantities in the equation can have integral co-efficients as well. So it becomes:

v = Xu + Yat where X,Y are integers

We basically can use a parse tree which is used for arithmetic and make a few changes to it to include the physics-terms or as I would refer to them as pterms.

;; Introducing pterm as a new grammer rule.
;; pterm is basically a coefficient and multiple symbols multiplied together.
;; The output format specified here is :enlive which will help us in formatting the parsed data later.
(def physics
  (insta/parser
    "expr       = add-sub
    <add-sub>   = mul-div | add | sub
    add         = add-sub <'+'> mul-div
    sub         = add-sub <'-'> mul-div
    <mul-div>   = pterm | term | mul | div
    mul         = mul-div <'*'> term
    div         = mul-div <'/'> term
    term        = number | <'('> add-sub <')'>
    number      = #'[0-9]+'

Parsing the tree!

Alright! Let us dive directly into the code:

;; instaparse provides us with an awesome function which is known as transform.
;; transform allows us to traverse the tree from bottom-up and
;; act upon each node data and replace it by operating upon it.
;; Since :expr will the root node, it will eventually contain the entire expression
;; map, hence we provide it the identity function.
(def parse-tree->schema-def
  {:sym         replace-sym
   :coefficient replace-coefficient
   :add         replace-add
   :sub         replace-sub
   :div         replace-div
   :term        replace-term
   :pterm       replace-pterm
   :expr        identity})

;; Each of the replace-"operator" function can be defined as follows:
;; (Considering replace-add for example)
(defn replace-add
  [term1 term2]
  {:Operator :add
  :LHS term1
  :RHS term2})

;; This allows us to simplify the function logic to a great extent.
;; The parser will give us with the the :add node with two children,
;; each representing a side of the operation.
;; As mentioned before, this is true for all functions.

Now, we have the basic setup for conversion ready, but again, we are missing a few important functions:

  1. replace-coefficient: Since a co-efficient is going to be a constant, we can just create an Atom map for the constant and return it. The Symbol for it can be the value of the co-efficient.

  2. replace-term: As we can see from the parsing rules above, term will always contain integral value and hence we can just return the Constant defined earlier.

Now, two most important functions left are replace-pterm and replace-sym. Let us see how we can tackle those:

;; The pterm is basically a multiplication of the co-efficient term and the symbol terms.
;; Assuming we have both of them defined correctly:
(defn replace-pterm
  [coeff-term symbol-term]
  {:Operator :multiply
   :LHS coeff-term
   :RHS symbol-term})

;; Now, this is a bit complicated.
;; Now, we can have multiple symbols being multiplied to each other.
;; Our expression model is pretty simple in that we have an LHS and an RHS both of which
;; can either be an Atom or an Expression.
;; Suppose we have the term "w*h*a*t", we can consider it as
;; w * (h * (a * t))
;; This exact same algorithm is implemented below.
;; It takes in the first element of the symbol list and recurses on the rest.
;; If we have a single element, we call find-sym which we will be describing later,
;; but if we have 2 elements, then we know that we have reached the end of the list and
;; can create an multiplication Expression.
(defn replace-sym
  [& sym]
  (case (count sym)
  1 (find-sym (first sym))
  2 {:Operator :multiply
      :LHS      (find-sym (first sym))
      :RHS      (find-sym (last sym))
     }
  ;; default case
  {:Operator :multiply
   :LHS      (first sym)
   :RHS      (replace-sym (rest sym))}))

All well and good. The only remaining piece in the puzzle, is find-sym. This is basically a lookup function into an Assoc-list. Henderson defines a map/list as an Assoc-list if we can look up the symbol dimensions into it. Easy.

;; Defines the variable 'v' as an Atom.
(def Assoc-List
  {:v {:Type :variable
       :Symbol "v"
       :Dimension {:M 0
                   :L 1
                   :T -1}})

;; Simple lookup into the Assoc-List map.
;; Returns UNDEF if not found.
;; We can have the user populate this map as an earlier exercise.
(defn find-sym
  [sym]
  (let [sym-key   (keyword sym)
        sym-value (sym-key Assoc-List)]
    (if (not-empty sym-value)
      sym-value
      "UNDEF")))

Yay!

Lo and behold! We now have a working program which can dimensionally analyze equations for us. We can then define a function to aggregate all of the above into a single API exposing the required functionality. Maybe for a later time ;). Let us see it running:

user=> (def equation (physics "2u+3at"))
#'user/equation
user=> (insta/transform parse-tree->schema-def equation)
{:Operator :add, :LHS {:Operator :multiply, :LHS {:Type :constant, :Symbol 2, :Dimension {:M 0, :L 0, :T 0}}, :RHS {:Type :variable, :Symbol "u", :Dimension {:M 0, :L 1, :T -1}}}, :RHS {:Operator :multiply, :LHS {:Type :constant, :Symbol 3, :Dimension {:M 0, :L 0, :T 0}}, :RHS {:Operator :multiply, :LHS {:Type :variable, :Symbol "a", :Dimension {:M 0, :L 1, :T -2}}, :RHS {:Type :variable, :Symbol "t", :Dimension {:M 0, :L 0, :T 1}}}}}
user=> (def parsed-result (insta/transform parse-tree->schema-def equation))
#'user/parsed-result
user=> (s/validate Expression parsed-result)
# Returns successful!
user=> (dim parsed-result)
{:M 0, :L 1, :T -1}

P.S: I am a novice Clojure programmer. Do suggest improvements in my coding style and the above code itself. It is hosted on my Github.

References:
Peter Henderson’s book on Functional Programming Application and Implementation

Software Engineering in Health Systems Workshop 2016 – Where Engineering is Not Enough

This weekend I had the opportunity to attend the Software Engineering in Health Systems (SEHS) Workshop in Austin Texas. Through a series presentations, panels, and group discussions on a range of health care systems topics two main themes seemed to arose:

  1. Health care is an information intensive domain in which contemporary software engineering techniques are ill-equipped to operate. Systems engineering techniques seem to handle the problems more easily (though significant work is required).

  2. Though health care professionals and engineers seem to collaborate in these environments, they each fail to fully grasp the perspective of the other; this results in “half-way” solutions which, address an immediate need but are brittle and ultimately result in system instability.

The good news, in my opinion, is that the solutions to these problems do not require tremendous innovation in science or engineering. Instead, they require the correct allocation of human resources to the right problems.

In fact, allocation of resources (including humans) is one of the three main views (others are static and dynamic structures) from which you can consider the architecture of a software system; these problems are architectural in nature.

I think, the solution to (1) is to bring software engineers to tables in which systems decisions are made. Kevin Sullivan made this point beautifully in one of the workshop discussions. He indicated that software engineers are trained to think about systems issues already; if they can be part of the discussion about the socio-technical systems in health care then we will begin to see progress.

(2) is a matter of education; interestingly though, I think, the problem cannot be addressed by another classroom course in a standard undergraduate education. In my experience, the knowledge required to understand the complexity of the operations being carried out in both health care and software engineering requires a more tacit type of knowledge one can only acquire by activity being part of the environment. Joachim Reiss, VP of R&D at Siemens Health Care, made this point exquisitely clear in his keynote presentation: we need to get engineers into hospitals and health care providers into design labs.

Of course, neither of these problems are trivial, they involve complex social and political structures that make change, especially social change, tremendously difficult. This is simultaneously a great and maddening realization. It is nice to know that we are very capable effecting massive change in health care systems through social change, however it is clear from history (and recent events) that social change can be extremely difficult to enact on a large scale. Sadly, as much as I wish we could, we cannot rely strictly on engineering to get solve this problem for us.

Finding Twitter Conversations

A Twitter conversation is a series of tweets that usually revolve around an initial tweet. The initial tweet is defined as the root of the conversation, and the rest of the tweets in the conversation would “stem” or “branch” from the root. Finding the root is relatively simple. For example, to find the root of a conversation (or conversations) about “warfarin” a keyword search is performed. Any tweet containing the word “warfarin” would be considered to be a conversation root. The trick is how to define and find “branching”. Branches are tweets that can form a coherent conversation when attached to the root.

Natural Language Cues

One technique is to look for linguistic and textual cues to indicate possible branches. Possible initial branches are replies to the root. Replies to the root tweet cannot be found directly. They can only be found through replies to the user that tweets the root (root user) through the @ symbol (or other variations). So a @root_user string would indicate a reply to the root user, which could be a reply to the root tweet or a reply to another one of the root user’s tweets. Also, any @ in the root tweet indicates that the root user is replying, to another user, with the root tweet. These users being replied to could also potentially write tweets related to the root tweet (reply-to tweets). Figure 1 below illustrates the different components of a twitter conversation.

Example Tweet Conversation-1
Figure 1: Example Twitter Conversation

To find whether a reply or a reply-to tweet is a branch of the root tweet (and therefore part of the conversation), further natural language analysis is required. Such tweets must form a coherent conversation with each other and with the root. The key for finding coherent conversations is to find terms that are possibly related that would act as links between the tweets. For example, a drug and an adverse drug event (ADE) could be related terms. Using such links as a starting point, dependency parsing algorithms would do the heavy lifting of connecting individual tweets into a conversation. The shortest dependency path measure [Liu and Chen 2013] (see figure 2 for example) could be used to find terms with a high probability of being related terms.

ExampleTweetSDP
Figure 2: Example of a Shortest Dependency Path for a part of the Last Tweet in Figure 1

Tweet Networks

Reply and Reply-to tweet network extend beyond the root tweet as each reply and reply-to tweet to the root can have its own reply and reply-to tweets. Reply and Reply-to tweets can form interesting networks. Well connected users or central users in such a network can be very important parts of a conversation and deserve special attention. These networks are directed graphs where each node would represent a user, and each node would be a tweet pointing from the sending user towards the receiving user. There are two main measures of centrality for each user in a reply and reply-to directed user network. Closeness centrality measures the average shortest path between a user and all the other users in the network. Betweeness centrality measures the number of shortest paths where a user is in the shortest path between two users. It is usually expressed as a probability that a certain user will be in the shortest path between any two users in the network. Both measures point out users that maybe central and important part of a conversation. The tweets of these users could form the main parts of a conversation. Figure 3 illustrates one example of these networks.

Example Tweet Network
Figure 3: Twitter Users’ Tweet Directed Graph Network

References

Liu, Xiao; Chen, Hsinchun; AZDrugMiner: An Information Extraction System for Mining Patient-Reported Adverse Drug Events in Online Patient Forums, ICSH 2013, LNCS 8040, pp. 134-150, (2013)

Debriefing on the Apple-FBI Debacle: The Aftermath

The Event

As many of you may have heard, in February the FBI requested that Apple make a modification to their systems to allow them to have access to an encrypted iPhone — which swiftly invoked the ire of the security community. Many experts asked why the FBI would even ask such a “ludicrous” and “short-sighted” question. They questioned the FBI’s understanding of basic encryption principles and quipped that the decision must have been made by a politician since no security expert would have made such a request. They further pointed to the past revelations about Snowden’s leaks and how many government associations have recently (and continue to) abuse the powers they have been given in this area.

Many worried that such a request would set a precedent, and even the FBI director admitted that it most likely would.

Apple responded in kind and denied the request. This signaled the start of significant political posturing by both players to garner support for their cause. The security community and many industry leaders quickly sided with Apple.

Ultimately the FBI elected to contract a third party who used an unknown exploit to gain access to the device. Both parties ceased their posturing and stood down.

The Aftermath

Three months later we observe the resulting fallout of this debacle.

Before this event even happened, trust in the U.S. government was near an all-time low, and this event did little to instill further confidence.

Beyond this, some have claimed that Apple “won” the battle as they did not provide the requested modifications — The FBI found another way around that did not require a court decision. This may be premature.

What Apple did do was prevent (for the time being) the FBI from having easy access to any iPhone they want in a systemic manner. This is important since the previous abuses of power stemmed from easy and systemic access to data (with or without warrants.

Prevention of such access requires the FBI to treat each case as its own separate issue:

  • Get the physical device
  • Get a warrant to gain access
  • Breach that specific device
  • Search for the evidence they need

The difference is in the difficulty and the case-by-case nature. It is not easy to do this process for 100,000 devices at one time. And it should not be. That is beyond the scope of a single case and some would say borders on surveillance-level.

As with most things, there is no black-and-white. There is always a trade-off between privacy and security, and it is not always clear which path should be taken.

What must be considered is that each breach of privacy for sake of security is a one-way transition. By breaching privacy through court-order, or through legislation, a precedent is set. This makes subsequent and wider-scope breaches easier. Alas there is an opportunity cost to our drive for total security. By resisting the FBI, Apple has resisted the erosion of privacy on a systemic level. Access to the device was still gained, but not through systemic means.

Furthermore, this event has heightened the awareness of privacy in the general populace. This can only be taken as a good sign. One should take their privacy seriously and be critical of any institution that intends to infringe on it. Not only did individuals take measures to secure their privacy, but other institutions did too. WhatsApp decided to strengthen their encryption as a direct result of the Apple case.

Unfortunately, the lack of precedent also means that this particular conflict may arise again in the future. And that future may include some event that spurs the government to forgo long-term thinking and rationality for a short-term solution to their immediate problems… Because, that hasn’t happened before.

All-in-all the result is favourable to both parties. Apple is enshrined as an advocate for privacy and maintains its position without alienating their consumers, while the FBI was able to ultimately gain access to the device in question.

As for the future? No-one knows.

Requirements Documentation with Gherkin

Blog: Requirements and Gherkin

We have been kicking off our ePRO study (more on this study later) and there are a range of partners in the study. We needed to quickly progress to a level of detail in requirements to ensure common ground even before we get IP and ethics in place.

For this work, I have chosen to use Gherkin, trying to drive some behaviour driven development (BDD). We are not expecting that we will use Gherkin throughout the development — each of the partners has their own methods and tooling and Gherkin / Cucumber is not used. Gherkin does provide a clear and accessible convention to structure requirements as features and scenarios.

Further, as domain specific languages are becoming a popular topic in our lab, it seems fitting to use another one for requirements.

Gherkin Materials

For those not familiar with Gherkin, several people have written about it better than I could here. For example:

  • The source on GitHub: https://github.com/cucumber/cucumber/wiki/Gherkin
  • Cucumber.io: https://cucumber.io/docs/reference
  • Behat has a great description with solid examples: http://docs.behat.org/en/v2.5/guides/1.gherkin.html
  • And there is the cucumber book: https://media.pragprog.com/titles/hwcuc/gherkin.pdf

Gherkin and ePRO

We have completed several drafts of the requirements iterating with our partners. The requirements have grown from a half dozen high-level features to 20 pages of more detailed features and scenarios. We have found it useful to organize the features around several actors and the partner / data streams. This works well, given how the project is structured as well as how the ePRO intervention was conceptualized. These requirements now include details such as example data to help the team members. Gherkin’s structure was well suited for this kind of work and the organization by actor was definitely useful.

Gherkin: Business Readable? Business Writable?

I appreciated a post from Martin Fowler reflecting if domain specific languages should be business readable or business writable. In our project, so far, it has been used in a business readable manner. That is, I have been the primary author and I invited comments from the team. Business owners are providing feedback, questions, changes, and notice areas where there is need for more detail. My role in our team is often a bridging role – I span clinical and information science – and that seems to be a good fit for me using a language like Gherkin.

Next Steps

This work will feed into some user centred design and testing cycles for ePRO that I am very much looking forward to starting.

Thoughts on IBM Watson

I recently attended a presentation by John Boyer, a Chief Architect on IBM’s Watson project. The presentation was about the Watson technology that IBM has, and continues to, develop. This was the second Watson presentation I have attended, the first was at CASCON 2014 where an IBM VP gave a keynote lecture. I must admit, after both presentations I walked away feel rather disappointed.

I must preface the remainder of this post by stating that this represents my opinion based on the information I have gathered through these presentations and reading the news. I could be entirely wrong.

The feeling of disappointment stems from what I thought Watson was. Before attending these presentations I had watched Watson play jeopardy and read multiple news stories about this fantastic new technology, it felt like Watson was a huge step forward, and it was going to meet all of our analytics needs for all of eternity and make me coffee every morning. I expected these presentations to dazzle and amaze me with new and exciting machine learning techniques or to provide some deep philosophical insight regarding AI. After all, IBM must some “secret sauce” that makes this tech work, right?

In retrospect, it is not surprising that I was disappointed, I set myself up for it. Anytime someone attributes a “god-like” status to anything, technology or other, they will inevitability be disappointed in some respect.

As far as I can tell, Watson is an aggregation of the numerous machine learning and data mining techniques developed over the last 50 or so years. Of course, IBM has improved and tuned these approaches. However, in the end they are relying the same basic approaches that the rest of the machine learning and data mining community rely upon. These have been combined in to support exactly one task: answering questions.

The entire purpose of Watson is to answer the questions based on the (potentially vast) amount of information available. This approach has allowed IBM to focus their efforts, there are numerous tasks that Watson can’t, and never will be able to, do, simply because it wasn’t designed for them. Watson isn’t some fantastically brilliant AI using some revolutionary algorithm or technique, it is just a (really big) program, and an excellent example of what contemporary machine learning is capable of at scale.

Watson is, in my opinion, a feat of engineering. Given that it uses the current state-of-the-art machine learning techniques, and not some IBM secret sauce, then it is an exceptionally well executed version of a machine learning tool. A tremendous amount of engineering effort has gone into bringing together different machine learning approaches and unifying them into package that can be deployed in a breadth of application domains (health care, finance, business management, security etc…). The complexity of a single domain, if not managed properly, is often enough to cripple most software projects; Watson takes this complexity in stride and provides value to its customers.

I think the most interesting thing about Watson, isn’t Watson itself but its application to different domains. IBM has taken approaches that have existed primarily in academia and made them accessible, on a huge scale, to the world. The value of machine learning and data mining techniques increases as we increase the amount of data processed. Watson will allow us process tremendous amounts of information, and to really understand the impact of machine learning in different application domains.

Finally, I think Watson provides us with the opportunity to understand the limitations of current machine learning techniques. What problems with these approaches will manifest at scale? At what point is machine learning no longer viable, how will we know when we reach that point, and what will we do after that?

In conclusion, I am disappointed that Watson isn’t some out of this world piece of technology, I really expected to be blown away. Upon reflection, this appears to have been a naive opinion, life lesson: “there is no silver bullet”. However, Watson is an exceptional tool that will change how we approach problems in numerous domains, but it is just that: a tool.