Hana Leehttps://hanalee.info/2017-01-25T00:00:00-06:00The ins and outs of refactoring2017-01-25T00:00:00-06:002017-01-25T00:00:00-06:00Hana Leetag:hanalee.info,2017-01-25:/blog/the-ins-and-outs-of-refactoring.html<p>Advantages of refactoring to more modular code</p><p>The end goal of refactoring is to revise our software to be more reusable, more maintainable,
and better designed. Frequently, that task involves breaking down
large components of code into smaller pieces. For example, one of the most
common refactoring patterns is <a href="https://refactoring.com/catalog/extractMethod.html">Extract
Method</a>, which creates a new
method from duplicated functionality. Decomposing our code in this fashion introduces
indirection into the system and supports a modular design. </p>
<p>The immediate benefits may not be obvious: we are introducing more parts to
maintain, and a reader of our code may be forced to "jump around" to
look up what's being referenced. But each part will be smaller and more
self-contained, which actually makes it simpler to navigate in the long run.</p>
<h2>The advantages of modularity</h2>
<p>Modularity encourages robustness—that is, stability and adaptability to perturbations—in
all sorts of complex systems, not only software. For example,
it is frequently seen in biological networks, where metabolic pathways remain
isolated from one another so that loss of functionality in one does not affect
the functioning of the rest. A bacterium that carries a mutation disabling its
ability to utilize lactose is still able to feed on and digest other sugars,
ensuring its survival.</p>
<p>Modularity also makes it easier to introduce novel behavior into the system
by creating variations in the interactions between existing components, instead of
building a new component from scratch. This type of organization gives an
evolutionary advantage in biology: organisms activate and
control sets of functionally related genes, which can be flexibly combined in
various ways to develop more complex features.</p>
<p>In software, a modular system means that a bug in one portion of the code will
have limited effects that remain isolated instead of bringing the whole
application to a halt. It also means that your code will be easier to test—and
if you follow a test-driven development (TDD) process, your code will naturally
tend to assume a modular organization, with isolatable parts. Adding new
features does not require as much new code to be written
since you can easily reuse existing code as building blocks. </p>
<h2>Identifying problems</h2>
<p>So how do we go about refactoring our code to become more modular? There are
certain code smells, which point out areas that can be decomposed into
smaller components. <a href="https://sourcemaking.com/refactoring/smells/long-method">Long
Method</a> and <a href="https://sourcemaking.com/refactoring/smells/large-class">Large
Class</a> are some obvious
ones, as well as <a href="https://sourcemaking.com/refactoring/smells/duplicate-code">Duplicate
Code</a>. </p>
<p>Another set
of clues to look for are violations of SOLID principles, particularly the Single
Responsiblity Principle (SRP) and Dependency Inversion Principle (DIP). A modular system
should consist of components that can be mixed and matched in various
combinations: thus, each component should have a clearly defined and singular
function (SRP). Moreover, its dependencies should be organized so that it can
remain as decoupled from other modules as possible (DIP).</p>
<p>To demonstrate, I'm going to take a look at some messy, gnarly code that I wrote
for a pair project during my apprenticeship: <a href="https://github.com/hnlee/reviewr/blob/master/app/controllers/projects_controller.rb">a controller class for a Rails
application</a>.</p>
<p>This class immediately stands out as a violation of SRP, since a lot of the
logic in its methods do not directly relate to the main responsibility of the
controller, which is to determine what views to render in response to HTTP
requests. For example, take a look at the <code>create</code> method:</p>
<div class="highlight"><pre><span></span> <span class="k">def</span> <span class="nf">create</span>
<span class="n">project</span> <span class="o">=</span> <span class="no">Project</span><span class="o">.</span><span class="n">new</span><span class="p">(</span><span class="ss">title</span><span class="p">:</span> <span class="n">project_params</span><span class="o">[</span><span class="ss">:title</span><span class="o">]</span><span class="p">,</span>
<span class="ss">link</span><span class="p">:</span> <span class="n">project_params</span><span class="o">[</span><span class="ss">:link</span><span class="o">]</span><span class="p">,</span>
<span class="ss">description</span><span class="p">:</span> <span class="n">project_params</span><span class="o">[</span><span class="ss">:description</span><span class="o">]</span><span class="p">)</span>
<span class="n">emails</span> <span class="o">=</span> <span class="n">params</span><span class="o">[</span><span class="ss">:emails</span><span class="o">]</span>
<span class="k">if</span> <span class="n">project</span><span class="o">.</span><span class="n">save</span>
<span class="no">ProjectOwner</span><span class="o">.</span><span class="n">create</span><span class="p">(</span><span class="ss">project_id</span><span class="p">:</span> <span class="n">project</span><span class="o">.</span><span class="n">id</span><span class="p">,</span>
<span class="ss">user_id</span><span class="p">:</span> <span class="n">current_user</span><span class="o">.</span><span class="n">id</span><span class="p">)</span>
<span class="n">emails</span><span class="o">.</span><span class="n">each</span> <span class="k">do</span> <span class="o">|</span><span class="n">email</span><span class="o">|</span>
<span class="n">user</span> <span class="o">=</span> <span class="no">User</span><span class="o">.</span><span class="n">find_or_create_by</span><span class="p">(</span><span class="ss">email</span><span class="p">:</span> <span class="n">email</span><span class="p">)</span>
<span class="no">ProjectInvite</span><span class="o">.</span><span class="n">create</span><span class="p">(</span><span class="ss">project_id</span><span class="p">:</span> <span class="n">project</span><span class="o">.</span><span class="n">id</span><span class="p">,</span>
<span class="ss">user_id</span><span class="p">:</span> <span class="n">user</span><span class="o">.</span><span class="n">id</span><span class="p">)</span>
<span class="no">InviteMailer</span><span class="o">.</span><span class="n">invite_email</span><span class="p">(</span><span class="n">project</span><span class="p">,</span> <span class="n">user</span><span class="p">)</span><span class="o">.</span><span class="n">deliver_now</span>
<span class="k">end</span>
<span class="n">redirect_to</span> <span class="n">user_path</span><span class="p">(</span><span class="n">current_user</span><span class="o">.</span><span class="n">id</span><span class="p">),</span> <span class="p">{</span> <span class="ss">flash</span><span class="p">:</span> <span class="p">{</span> <span class="ss">notice</span><span class="p">:</span> <span class="s2">"Project has been created"</span> <span class="p">}</span> <span class="p">}</span>
<span class="k">else</span>
<span class="n">redirect_to</span> <span class="n">new_project_path</span><span class="p">(</span><span class="ss">user</span><span class="p">:</span> <span class="n">project_params</span><span class="o">[</span><span class="ss">:user_id</span><span class="o">]</span><span class="p">),</span> <span class="p">{</span> <span class="ss">flash</span><span class="p">:</span> <span class="p">{</span> <span class="ss">error</span><span class="p">:</span> <span class="n">project</span><span class="o">.</span><span class="n">get_error_message</span> <span class="p">}</span> <span class="p">}</span>
<span class="k">end</span>
<span class="k">end</span>
</pre></div>
<p>There is a lot of code in there that relates to creating a new project model and
sending out invite emails, which should be separate from the task of
rendering views.</p>
<p>Moreover, the code that manages sending out invite emails is almost fully
duplicated in another method, a clear code smell:</p>
<div class="highlight"><pre><span></span> <span class="k">def</span> <span class="nf">update</span>
<span class="vi">@project</span> <span class="o">=</span> <span class="no">Project</span><span class="o">.</span><span class="n">find_by_id</span><span class="p">(</span><span class="n">params</span><span class="o">[</span><span class="ss">:id</span><span class="o">]</span><span class="p">)</span>
<span class="vi">@invited_reviewers</span> <span class="o">=</span> <span class="vi">@project</span><span class="o">.</span><span class="n">get_invited_reviewers</span>
<span class="n">emails</span> <span class="o">=</span> <span class="n">params</span><span class="o">[</span><span class="ss">:emails</span><span class="o">]</span>
<span class="k">if</span> <span class="vi">@project</span><span class="o">.</span><span class="n">update_attributes</span><span class="p">(</span><span class="n">project_params</span><span class="p">)</span>
<span class="k">if</span> <span class="n">emails</span>
<span class="n">emails</span><span class="o">.</span><span class="n">each</span> <span class="k">do</span> <span class="o">|</span><span class="n">email</span><span class="o">|</span>
<span class="k">if</span> <span class="o">!</span><span class="vi">@invited_reviewers</span><span class="o">.</span><span class="n">find_by</span><span class="p">(</span><span class="ss">email</span><span class="p">:</span> <span class="n">email</span><span class="p">)</span>
<span class="n">user</span> <span class="o">=</span> <span class="no">User</span><span class="o">.</span><span class="n">find_or_create_by</span><span class="p">(</span><span class="ss">email</span><span class="p">:</span> <span class="n">email</span><span class="p">)</span>
<span class="no">ProjectInvite</span><span class="o">.</span><span class="n">create</span><span class="p">(</span><span class="ss">project_id</span><span class="p">:</span> <span class="vi">@project</span><span class="o">.</span><span class="n">id</span><span class="p">,</span>
<span class="ss">user_id</span><span class="p">:</span> <span class="n">user</span><span class="o">.</span><span class="n">id</span><span class="p">)</span>
<span class="no">InviteMailer</span><span class="o">.</span><span class="n">invite_email</span><span class="p">(</span><span class="vi">@project</span><span class="p">,</span> <span class="n">user</span><span class="p">)</span><span class="o">.</span><span class="n">deliver_now</span>
<span class="k">end</span>
<span class="k">end</span>
<span class="k">end</span>
<span class="o">...</span>
<span class="k">end</span>
</pre></div>
<p>Looking for indicators of bad design and thinking about where the natural
boundaries lie in your software help identify places where refactoring is
needed.</p>
<h2>Refactoring safely</h2>
<p>At this stage, it may seem tempting to dive right in and start cutting and
pasting, but there's a process to refactoring that ensures you don't change the
behavior of the application while making your changes. (That's in the very
definition of refactoring, after all!) Martin Fowler, when
describing the Extract Method refactoring pattern, outlines the following steps:</p>
<ol>
<li>Create a new method and name it for what the extracted method does.</li>
<li>Copy (not cut or move) the extracted code to the new method.</li>
<li>Look for variables that are local in scope to the old method.</li>
<li>If they are only used in the extracted code, then declare them as temporary
variables in the new method.</li>
<li>If they are used in other parts of the old method, then pass them in as
parameters to the new method.</li>
<li>Replace the extracted code in the old method with a call to the new method.</li>
</ol>
<p>Of course, implicit to these steps is the assumption that you will run your test
suite after every step to make sure that the behavior of the application has not
changed.</p>
<p>This algorithm for refactoring reminds me of a phenomenon in evolution: a gene
undergoes duplication, interactions with the original gene switch over to the
new copy, and then the first gene gradually becomes nonfunctional, at which
point it is called a pseudogene. The main difference from refactoring is that
the evolutionary process takes many, many generations—a lot slower than simply
manipulating text in an editor!</p>
<p>Applying these steps to my example above, first I create a new method and
copy over the code I want to extract:</p>
<div class="highlight"><pre><span></span><span class="k">def</span> <span class="nf">email_invited_reviewers</span>
<span class="n">emails</span><span class="o">.</span><span class="n">each</span> <span class="k">do</span> <span class="o">|</span><span class="n">email</span><span class="o">|</span>
<span class="n">user</span> <span class="o">=</span> <span class="no">User</span><span class="o">.</span><span class="n">find_or_create_by</span><span class="p">(</span><span class="ss">email</span><span class="p">:</span> <span class="n">email</span><span class="p">)</span>
<span class="no">ProjectInvite</span><span class="o">.</span><span class="n">create</span><span class="p">(</span><span class="ss">project_id</span><span class="p">:</span> <span class="n">project</span><span class="o">.</span><span class="n">id</span><span class="p">,</span>
<span class="ss">user_id</span><span class="p">:</span> <span class="n">user</span><span class="o">.</span><span class="n">id</span><span class="p">)</span>
<span class="no">InviteMailer</span><span class="o">.</span><span class="n">invite_email</span><span class="p">(</span><span class="n">project</span><span class="p">,</span> <span class="n">user</span><span class="p">)</span><span class="o">.</span><span class="n">deliver_now</span>
<span class="k">end</span>
<span class="k">end</span>
</pre></div>
<p>Next, I look at the variables. Most of the variables aren't used
outside the <code>emails.each</code> block, but <code>emails</code> itself is defined in the original
method from <code>params</code>. So it should probably be passed in as a parameter to the
new method.</p>
<div class="highlight"><pre><span></span><span class="k">def</span> <span class="nf">email_invited_reviewers</span><span class="p">(</span><span class="n">emails</span><span class="p">)</span>
<span class="n">emails</span><span class="o">.</span><span class="n">each</span> <span class="k">do</span> <span class="o">|</span><span class="n">email</span><span class="o">|</span>
<span class="n">user</span> <span class="o">=</span> <span class="no">User</span><span class="o">.</span><span class="n">find_or_create_by</span><span class="p">(</span><span class="ss">email</span><span class="p">:</span> <span class="n">email</span><span class="p">)</span>
<span class="no">ProjectInvite</span><span class="o">.</span><span class="n">create</span><span class="p">(</span><span class="ss">project_id</span><span class="p">:</span> <span class="n">project</span><span class="o">.</span><span class="n">id</span><span class="p">,</span>
<span class="ss">user_id</span><span class="p">:</span> <span class="n">user</span><span class="o">.</span><span class="n">id</span><span class="p">)</span>
<span class="no">InviteMailer</span><span class="o">.</span><span class="n">invite_email</span><span class="p">(</span><span class="n">project</span><span class="p">,</span> <span class="n">user</span><span class="p">)</span><span class="o">.</span><span class="n">deliver_now</span>
<span class="k">end</span>
<span class="k">end</span>
</pre></div>
<p>Now I can return to the <code>create</code> method and replace that code with a call to
<code>email_invited_reviewers</code>:</p>
<div class="highlight"><pre><span></span> <span class="k">def</span> <span class="nf">create</span>
<span class="n">project</span> <span class="o">=</span> <span class="no">Project</span><span class="o">.</span><span class="n">new</span><span class="p">(</span><span class="ss">title</span><span class="p">:</span> <span class="n">project_params</span><span class="o">[</span><span class="ss">:title</span><span class="o">]</span><span class="p">,</span>
<span class="ss">link</span><span class="p">:</span> <span class="n">project_params</span><span class="o">[</span><span class="ss">:link</span><span class="o">]</span><span class="p">,</span>
<span class="ss">description</span><span class="p">:</span> <span class="n">project_params</span><span class="o">[</span><span class="ss">:description</span><span class="o">]</span><span class="p">)</span>
<span class="n">emails</span> <span class="o">=</span> <span class="n">params</span><span class="o">[</span><span class="ss">:emails</span><span class="o">]</span>
<span class="k">if</span> <span class="n">project</span><span class="o">.</span><span class="n">save</span>
<span class="no">ProjectOwner</span><span class="o">.</span><span class="n">create</span><span class="p">(</span><span class="ss">project_id</span><span class="p">:</span> <span class="n">project</span><span class="o">.</span><span class="n">id</span><span class="p">,</span>
<span class="ss">user_id</span><span class="p">:</span> <span class="n">current_user</span><span class="o">.</span><span class="n">id</span><span class="p">)</span>
<span class="n">email_invited_reviewers</span><span class="p">(</span><span class="n">emails</span><span class="p">)</span>
<span class="n">redirect_to</span> <span class="n">user_path</span><span class="p">(</span><span class="n">current_user</span><span class="o">.</span><span class="n">id</span><span class="p">),</span> <span class="p">{</span> <span class="ss">flash</span><span class="p">:</span> <span class="p">{</span> <span class="ss">notice</span><span class="p">:</span> <span class="s2">"Project has been created"</span> <span class="p">}</span> <span class="p">}</span>
<span class="k">else</span>
<span class="n">redirect_to</span> <span class="n">new_project_path</span><span class="p">(</span><span class="ss">user</span><span class="p">:</span> <span class="n">project_params</span><span class="o">[</span><span class="ss">:user_id</span><span class="o">]</span><span class="p">),</span> <span class="p">{</span> <span class="ss">flash</span><span class="p">:</span> <span class="p">{</span> <span class="ss">error</span><span class="p">:</span> <span class="n">project</span><span class="o">.</span><span class="n">get_error_message</span> <span class="p">}</span> <span class="p">}</span>
<span class="k">end</span>
<span class="k">end</span>
</pre></div>
<p>I can even go to the <code>update</code> method and replace it with a similar call. The
only wrinkle is that there is an extra conditional in the <code>update</code> method to
check whether the person has already been sent an email invite. One way to
handle that is by extracting another method to filter <code>emails</code> before passing it to
<code>email_invited_reviewers</code>.</p>
<div class="highlight"><pre><span></span><span class="k">def</span> <span class="nf">filter_new_reviewers</span><span class="p">(</span><span class="n">emails</span><span class="p">,</span> <span class="n">invited_reviewers</span><span class="p">)</span>
<span class="n">emails</span><span class="o">.</span><span class="n">select</span><span class="p">{</span><span class="o">|</span><span class="n">email</span><span class="o">|</span> <span class="o">!</span><span class="n">invited_reviewers</span><span class="o">.</span><span class="n">find_by</span><span class="p">(</span><span class="ss">email</span><span class="p">:</span> <span class="n">email</span><span class="p">)}</span>
<span class="k">end</span>
</pre></div>
<p>Then we can replace the code in <code>update</code> with calls to both methods:</p>
<div class="highlight"><pre><span></span> <span class="k">def</span> <span class="nf">update</span>
<span class="vi">@project</span> <span class="o">=</span> <span class="no">Project</span><span class="o">.</span><span class="n">find_by_id</span><span class="p">(</span><span class="n">params</span><span class="o">[</span><span class="ss">:id</span><span class="o">]</span><span class="p">)</span>
<span class="vi">@invited_reviewers</span> <span class="o">=</span> <span class="vi">@project</span><span class="o">.</span><span class="n">get_invited_reviewers</span>
<span class="n">emails</span> <span class="o">=</span> <span class="n">params</span><span class="o">[</span><span class="ss">:emails</span><span class="o">]</span>
<span class="k">if</span> <span class="vi">@project</span><span class="o">.</span><span class="n">update_attributes</span><span class="p">(</span><span class="n">project_params</span><span class="p">)</span>
<span class="k">if</span> <span class="n">emails</span>
<span class="n">filtered_emails</span> <span class="o">=</span> <span class="n">filter_new_reviewers</span><span class="p">(</span><span class="n">emails</span><span class="p">,</span> <span class="vi">@invited_reviewers</span><span class="p">)</span>
<span class="n">email_invited_reviewers</span><span class="p">(</span><span class="n">filtered_emails</span><span class="p">)</span>
<span class="k">end</span>
<span class="o">...</span>
<span class="k">end</span>
</pre></div>
<p>There's a lot more refactoring that could be done to improve the modularity of
this code. For example, the <code>email_invited_reviewers</code> and <code>filter_new_reviewers</code>
methods probably belong in new classes, one to handle email actions and another
to handle looking up data from models. Both the <code>create</code> and <code>update</code> controller
methods I've shown above could also use a few more method extractions, so that the
responsibilities for making new models or updating existing ones does not lie
with the controller. And that's not even getting into the rest of the methods
in this class...</p>
<p>That's one of the reasons that refactoring should be a continual process
throughout development. Ideally, we ought to refactor during every TDD cycle
("Red, Green, Refactor"), but if that doesn't happen, we should do it when we first notice a
problem, instead of leaving it for some future time that inevitably never comes. That's
a lesson I learned quite personally during my apprenticeship!</p>
<h2>Takeaways</h2>
<ul>
<li>Refactoring often involves breaking down large chunks of code into smaller
pieces. That improves design because it promotes modularity.</li>
<li>Modular systems are more robust and adaptable to change in general, not
just in software.</li>
<li>Identify areas for refactoring through code smells and violations of SOLID
principles.</li>
<li>The process of refactoring requires that the application behavior remains
unchanged. Take it step by step, and don't forget to run your test suite
often.</li>
</ul>Pairing tour2017-01-12T00:00:00-06:002017-01-12T00:00:00-06:00Hana Leetag:hanalee.info,2017-01-12:/blog/pairing-tour.html<p>Summary of my apprenticeship pairing tour</p><p>The penultimate stage of the resident apprenticeship at 8th Light is a pairing
tour, where I got to spend time pairing with the crafters who will sit on my
Review Board (i.e. the people who decide whether I get to "graduate" from my
apprenticeship). I spent a day with each crafter and got exposed to a variety of
different clients, languages and frameworks, project management tools, and work
styles, all in under two weeks. Here are some notes and thoughts on how it went.</p>
<p><strong>Eric M.</strong>: The day was spent updating the front-end Javascript library used for
UI elements to use ES6, with Webpack and Karma to run builds and tests, rather than
Coffeescript with Grunt. Since my previous experience with Javascript had been
with ES5, I got a taste of how ES6 and Coffeescript syntax differed. It also
showed me what a well-constructed Javascript library should look
like, with good unit test coverage and a modular architecture that lends itself
to reuse.</p>
<p><strong>Kristin</strong>: I had another day spent with Javascript, hunting down a bug that turned out
to be especially difficult to trace because of duplication that existed due to
the feature being in a transition phase. We kept inserting <code>console.log()</code>
statements and seeing nothing in the inspector until we finally realized that
the code being called existed in a completely different portion of the file
tree! Once we figured out what going on, we could then fix the bug, then fix
another bug that we discovered while testing the functionality in the browser,
and finally make a pull request with the changes. It also sparked some good
software design discussions about why the duplication existed and what the best way would be to
resolve the confusion it created.</p>
<p><strong>Doug</strong>: Pairing with Doug was a bit unusual, since he is one of the two Managing
Directors of the Chicago office. But it was rather fascinating to shadow him
during his day and find out what his many responsibilities look like.
Some meetings were with other management making high-level strategic
or staffing decisions, and other meetings were with the teams that
reported to Doug. Up until now, I haven't had many opportunities to see a really
good manager in action, so I was really struck by the fact that the crafters
that spoke with Doug all had a great deal of trust in him and were comfortable
being honest about the problems they were facing in their work. I think this
trust is able to exist when the manager shows trust in the team first. </p>
<p><strong>Vincent</strong>: Vincent is one of my mentors, so I already knew him quite well, but I
had never seen any of his client work before. We spent the day trying to lay the
groundwork for "A/B testing" (quotation marks because the test group would be
selected nonrandomly and their identities known) by exposing endpoints that
would be necessary for identifying whether a user belonged to the test or
control group. Due to the existing infrastructure, the service that would have
the test feature could not talk directly to the database containing the user
information, so we had to figure out how to communicate that data via an
intermediary service that connected to a Grape API that connected to the database.
(If that sounded complicated, it definitely was.)</p>
<p><strong>Eric S.</strong>: Eric is my other mentor, and since his usual client has a lot of
security procedures, I paired with him on a day when he didn't have to be on
site. Since he's the Director of Training, I ended up sitting in on a few
meetings that related to the various training services that 8th Light provides
to businesses and even got to provide some feedback on a curriculum being
developed. Towards the end of the day, he walked me through one of his side projects,
implementing a browser version of Space Invaders in F# and discussed how to
test a game loop in a functional paradigm. I didn't know any F#, so it was an
interesting chance to look at an explicitly typed functional language, as well
as a good demonstration of how to make things that are difficult to test more
testable.</p>
<p><strong>Nicole</strong>: I was supposed to pair with Zack originally, but because he is in
between clients, he decided that it would be more productive for me to pair with Nicole, his former
apprentice. I had paired with Nicole before, when she was still an apprentice,
on our feedback rating app, so it was good to be working with her again. We
worked on an internal 8th Light application that was written in Clojure; it was
also good to be working in Clojure again. We wrote a validator for a new
field added to a form and spent quite a lot of time hunting down every test that
used that form in some way to make sure that those tests now worked with the new
validation requirement. The test suite all passed but then the feature didn't
work in the browser; the bug required some outside assistance to track down
and turned out to be due to a parameter not being passed to a third-party
library.</p>
<p><strong>Rob</strong>: The first story of the day was adding a logger to an API managing
document uploads. The code was all in C# and had to be edited in Visual Studio
through a VM, which was a first for me. We used a third-party library to do the
logging, but it took a lot of reading and experimenting to get it installed and
then figure out where it needed to be injected. But we got it to log all API
requests and responses and finished the story pretty quickly. Then we waded
through the details of a few other stories and realized that they either had
already been addressed or needed further consultation with business
stakeholders. The last story we worked on was finishing the setup of a Flask API
that would eventually hold endpoints for a new service. That was getting close
to the end of the day, but I did get to write some Python and put in a passing
test for the root path route.</p>
<p><strong>Lisa</strong>: I paired with Lisa today, as the last stop on my pairing tour. She is
working on a greenfield app, which has a React front-end and a Rails backend.
The morning was spent on doing a code review of a pull request and addressing
comments on a pull request she had made before, and then the afternoon was spent
pairing with Jerome, another 8th Light crafter on the same project. I found React to be really
interesting, since it makes some aspects of DOM manipulation seem a lot easier
than just with plain Javascript and JQuery, so now I'm trying to think of ways
that I can learn more about the framework in a future side project. Lisa also
had a whole host of useful tips---from making more informative pull requests to using
<code>git checkout -</code> to switch to the last branch---that she imparted during the
day, which I plan to put into use.</p>
<p>Overall, the pairing tour was a great experience, allowing me to see what a
crafter actually does on a day-to-day basis. Getting to see so many new code
bases, all with different languages and frameworks, was also really
illuminating; it shed new light on how a lot of decisions (good and bad) about
software architecture get made as well as the challenges of working with legacy
code. I'm in the middle of reading <em>Working Effective With Legacy Code</em>, and I
think some of the content will make more sense to me now that I've actually
experienced trying to modify a truly large existing code base. Everything that
I've worked on up until now has been quite small and manageable,
with not more than three contributors at most; it really is worlds apart from
a piece of software that multiple teams have worked on, often for several years.</p>
<p>I also got to see more than just code. It was really educational to see how
different developer teams work together: how they conduct their standups and
IPMs, what their development workflows are, how they divvy up story cards and
tasks, and even just how they consult one another individually when they have
questions. Reading about agile processes in a book is not really the same as
seeing how they are conducted in real life. Each organization, as well as each
individual, ends up making compromises, and it's interesting to see how the same
tradeoffs recur over and over again, although they are solved in different ways.</p>
<p>Tomorrow begins the final stage: challenges! I have no idea what these will be
like since they generally keep them a secret from apprentices. What I do know is
that the next two weeks will probably be intense. Wish me luck!</p>Testing and data analysis in Python2016-12-21T00:00:00-06:002016-12-21T00:00:00-06:00Hana Leetag:hanalee.info,2016-12-21:/blog/testing-and-data-analysis-in-python.html<p>Incorporating testing into my data analysis workflow</p><p>Something that occurred to me more frequently than I would like to admit while I
was writing code to analyze data back in graduate school was that I would start a
computationally intensive process, only to find that it would terminate at the
penultimate step due to a simple bug in my code. At the time, I knew nothing
about testing frameworks or test-driven development. I had a variety of
approaches of dealing with this problem:</p>
<ul>
<li>Go through statements one by one in the REPL (the ones that I could run
quickly, anyway)</li>
<li>Make sure I could check on interim progress through standard output or output files</li>
<li>Split up my scripts into smaller units</li>
</ul>
<p>None of these really substitute for the confidence you have with test coverage,
however. Since I've returned to wrangling with data on my current client
project, I've sought to incorporate testing into my workflow.</p>
<p>Most data scientists endorse the use of <a href="http://jupyter.org/">notebooks</a> as a form of reproducible
research. Indeed, it's a good way to explore a data set and share your analysis.
I've been treating what I do in notebooks as a spike; it's a good place for one-off
code and getting familiar with new APIs or libraries.</p>
<p>But when I find myself repeating a block of code over and over again, I move to the text editor and start writing tests for a reusable
function. These functions then assemble into modules that I can import as necessary to carry out any lengthy processes or computations. I'm not quite at the stage when the final script I run only makes calls to test-covered functions, but I'm getting there. And certainly, the bugs I'm experiencing now tend to happen in the parts of the script that use untested functions.</p>
<p>I use <a href="http://doc.pytest.org/en/latest/">pytest</a> to run my tests. I find the
output easier to read and interpret than the built-in
<a href="https://docs.python.org/3/library/unittest.html">unittest</a>. Other people
recommend <a href="https://nose2.readthedocs.io/en/latest/index.html">nose2</a>, which I
haven't tried yet.</p>
<p>At the current stage of my project, I've been using the APIs from various social
media platforms to collect data. While testing the functions I use to call these
APIs and parse the JSON responses, I try to avoid directly interacting with
those APIs (especially because they might have rate limits!) and create small
mock APIs to pass to my functions instead. I found one library that was
particularly good for mocking out HTTP requests and JSON responses, called
appropriately enough <a href="https://github.com/getsentry/responses">responses</a>. Note
that if you're having trouble with the mocks, make sure you have the
<code>match_querystring</code> boolean set to <code>True</code>. Otherwise, any query parameters you add to
the URI of your request won't be matched. (That's not within the documentation
anywhere, and I had to go into the source code to figure out that parameter
existed.)</p>
<p>I haven't heard much about testing frameworks or TDD in data science circles. People do
talk a lot about validation, but that's not of the code itself but of its
output. I think it's
because data science doesn't focus so much on building software; most of the
code we write is procedural and not necessarily extensively reused. Since we
may work primarily or only with third-party libraries (which hopefully <em>are</em>
already
well-tested before release), the need for testing the code ourselves we write probably does not seem as
urgent. Nonetheless, taking the time to write tests, even for a simple series of statements that you
don't plan to use again, can help you be confident that the scripts you're
running won't fail on some easily preventable bug. </p>Feedback from my mini review board2016-12-20T00:00:00-06:002016-12-20T00:00:00-06:00Hana Leetag:hanalee.info,2016-12-20:/blog/feedback-from-my-mini-review-board.html<p>Current status and areas for improvement</p><p>A few weeks ago, I had a mini review board to look over the code I contributed
to the pair project that I worked on with a fellow apprentice. After
she rolled off the project, I spent an additional two
weeks working on it solo, finishing up some stories in the backlog and adding
some minor features. Then a group of six crafters, including my two mentors, as
well as the two mentors of the apprentice with whom I paired, reviewed my
code and gave me feedback on the project as well as on the status of my
apprenticeship overall. It's called a "mini" review board because it's a much
smaller preview of the review board that will ultimately review the work I
do during the last two weeks of my apprenticeship to decide whether I will move
on to be a software crafter.</p>
<p>Most of the feedback focused around object-oriented design. The app we built
more or less followed the architecture of a traditional Rails app, with the
accompanying weaknesses. In particular, there was a lot of refactoring that I
could do on the controllers, extracting out the logic into <a href="http://multithreaded.stitchfix.com/blog/2015/06/02/anatomy-of-service-objects-in-rails/">interactors or
service objects</a> that could then be tested in isolation. Some of the logic in the
views could be put in interactors as well, which sounded like a good move
since there's a fair bit of functionality there that is currently only really being
covered in acceptance tests, rather than in unit tests.</p>
<p>My mentor paired with me later to show me how to safely refactor and
extract interactors from my controllers. Another advantage of this approach is
that it also allows me to do a bit of dependency inversion and not require
manipulation of ActiveRecord models in my unit tests. It's not quite the same
as the <a href="http://martinfowler.com/eaaCatalog/repository.html">Repository pattern</a>, but it achieves a similar goal.</p>
<p>On similar lines, the Javascript in our app currently barely qualifies as
"object-oriented". There are constructors and objects are initialized, but in
reality, the classes are really just namespaces for related functions. One of
the key suggestions I received was to refactor the Javascript to use
encapsulation and implement some of the object-oriented design principles
that I've spent the past several months reading about.</p>
<p>To be honest, we hadn't spent much time discussing the app's architecture while
building the project. We made decisions about the database schema and briefly
discussed whether to implement the Repository pattern (which we attempted then
abandoned because we ran out of time), but otherwise, we unquestioningly
followed the architecture that Rails sets out by default. But of course, when one is a professional consultant, one doesn't want to simply demonstrate the
ability to follow Rails tutorials to the letter. As one of the crafters on my
review board put it, this app doesn't really show anything of my personality
as a developer. So my task is to work on refactoring the code, focusing
particularly on one of the controllers and the Javascript, to get practice in
making those design decisions about the architecture of the app. I've found this
blog post, which my mentor sent to me, particularly useful for thinking about
the bigger picture: <a href="https://8thlight.com/blog/uncle-bob/2012/08/13/the-clean-architecture.html">Clean
Architecture</a>.</p>
<p>I would like to say that I've made a lot of progress on this task in the weeks
since my mini review board was held, but alas, I've had my time taken up with
a client project instead. I'm quite excited about this project actually,
which involves data science and machine learning, but it's left me with no
opportunities to wrap up loose ends. But I'm going to try to set aside a little time
every day to work on implementing the suggestions I received.</p>
<p>The pair project was my introduction to Ruby and Rails. I have to admit that
I'm not particularly thrilled by Ruby, and Rails is such a behemoth of a
framework that it makes Django look quite small by comparison. Ruby
looks deceptively similar to Python at first, but its underlying
philosophy is almost the antithetical to the Pythonic style of coding.
Nonetheless, so many modern web applications are built in Ruby and Rails that
I appreciate the chance to gain experience with these tools, even if they
wouldn't be my first choice on a personal project. I think it would be
worthwhile to fiddle around with Ruby outside the Rails context. I mean,
Ruby will still be Ruby, with its apparent belief that having multiple
ways of saying the same thing actually makes developers "happy", but at
least it might give me a chance to see the charms of the language beyond the
limited scope of a Rails project.</p>
<p>This pair project also introduced me to Javascript and all the pains of trying
to test user interactions. I can't say that I <em>enjoyed</em> writing Javascript,
but I did find it interesting to try to get it under test. One of the stories
that I worked on solo after my fellow apprentice left the project was to do a timeboxed spike to see if I could get an estimate on how much work it would be to build an <a href="https://ionicframework.com/">Ionic</a> hybrid mobile app to work with the Rails backend.
Building an API for the app in Rails didn't seem too difficult actually, but Ionic is built on top of Angular, which in turn uses Typescript, which posed quite a learning curve. (It didn't help that there was a major breaking change between Ionic 1 and 2, to match the difference between Angular 1 and 2!) I barely managed to get a Google login screen working before I ran out of time on my spike.
Nonetheless, that's definitely an area that I would like to explore further in
the future. I haven't built a single mobile app yet, and that's something I
would like to learn how to do.</p>Links, the bash and functional programming edition2016-12-02T00:00:00-06:002016-12-02T00:00:00-06:00Hana Leetag:hanalee.info,2016-12-02:/blog/links-the-bash-and-functional-programming-edition.html<p>Useful links related to bash and functional programming</p><p>For someone who once had to exclusively use Linux at work for several years, my
knowledge of bash is extremely mediocre. Luckily, there are a wide range of
references and tutorials out there.</p>
<p><a href="http://explainshell.com/">Explain Shell</a>: Type in any command line and have it
broken down and explained, piece by piece.</p>
<p><a href="https://github.com/tldr-pages/tldr">tldr</a>: A more readable version of <code>man</code>
pages for commonly used commands.</p>
<p><a href="https://learnpythonthehardway.org/book/appendixa.html">Command Line Crash
Course</a>: Mostly covers
the basics that I already knew, but it's still a good place to get started. </p>
<p><a href="https://www.humblebundle.com/books/unix-book-bundle">Humble Bundle Unix Books</a>:
This bundle is still active for about four more days and includes pretty much
all the O'Reilly guides you would ever need. I've been meaning to purchase the
one on <code>sed</code> and <code>awk</code> for a while now since I don't know how to use either.</p>
<p>And now for the functional programming half...</p>
<p><a href="https://fsharpforfunandprofit.com/fppatterns/">Functional Programming Design
Patterns</a>: Slides and video for a
talk on functional programming design patterns, which was linked in the <a href="http://8thlight.com">8th
Light</a> company Slack. I didn't entirely digest a lot of the
principles he describes in the talk, so it's definitely a link I hope to
revisit and ponder more deeply.</p>
<p><a href="http://blog.cleancoder.com/uncle-bob/2014/11/24/FPvsOO.html">OO vs FP</a>: Blog
post by Uncle Bob, in part a reaction to the above. It lays out the case that
object-oriented programming and functional programming are not such
fundamentally different paradigms and that the same design patterns used in
object-oriented programming can be used in functional programming as well.</p>
<p><a href="https://speakerdeck.com/trptcolin/adopting-fp-the-good-the-familiar-and-the-unknown">Adopting FP: The Good, the Familiar, the
Unknown</a>:
Talk given by our CTO at the inaugural <a href="http://www.meetup.com/ChicagoSC/">Chicago Software Craftsmanship
Meetup</a>. Also makes a similar case that a lot
of the design principles and patterns from object-oriented programming may be
transferrable or have their equivalents in functional programming.</p>Links, the unthemed edition2016-11-18T00:00:00-06:002016-11-18T00:00:00-06:00Hana Leetag:hanalee.info,2016-11-18:/blog/links-the-unthemed-edition.html<p>Useful miscellaneous links</p><p>This time, my list follows no particular theme, but here are some links that
I've collected lately and found interesting:</p>
<p><a href="http://semver.org/">Semantic Versioning</a>: Guide to how semantic versioning
works.</p>
<p><a href="https://medium.freecodecamp.com/a-study-plan-to-cure-javascript-fatigue-8ad3a54f2eb1#.n4fw8yqll">A Study Plan to Cure Javascript Fatigue</a>:
This post provides a pretty good guide for how to
learn Javascript, starting with React, ES6, Redux, and
GraphQL. Dealing with the front-end user interactive functionality on my current
project has made me realize how useful Javascript can be, and I think
the resources in this post will come in useful for any future web development
projects.</p>
<p><a href="https://medium.com/@olivercameron/20-weird-wonderful-datasets-for-machine-learning-c70fc89b73d5#.tid0gfng7">20 Weird & Wonderful Datasets for Machine Learning</a>: Always worth having links to any useful open data sets. I'm particularly interested in the mushroom one, which could make for a fun side project.</p>
<p><a href="https://git-scm.com/docs/git-reflog"><code>git reflog</code></a>: I made a mistake during
rebasing and thought I had lost all the changes I had made on the branch. My
mentor suggested that I check <code>git reflog</code>, which completely saved me from the
horrible prospect of having to repeat a good two or three hours of work.
Resolved to be more vigilant about rebasing in the future, but in the meantime,
I am glad that there's always a backup somewhere in <code>git</code>, as long as you
remember to commit.</p>Testing AJAX calls and DOM manipulation2016-11-03T00:00:00-05:002016-11-03T00:00:00-05:00Hana Leetag:hanalee.info,2016-11-03:/blog/testing-ajax-calls-and-dom-manipulation.html<p>Challenges encountered while testing Javascript in a Rails application and how we solved them</p><p>I'm currently working on a pairing project with <a
href="http://github.com/NicoleCarpenter">Nicole</a>, a fellow <a
href="http://8thlight.com">8th Light</a> apprentice, where we are developing an
internal tool intended for rating how well people review projects on three
criteria: whether they are kind, specific, and actionable. (While the obvious
use case is for evaluating the helpfulness of code reviews, it could conceivably
be used for any endeavor where feedback is solicited, including blog posts or
event planning.) We are building the application in Ruby and Rails, but there
are several features that require Javascript to mediate user interactions. Most
of these are fairly simple uses of jQuery and AJAX to display a form or error
message dynamically on the same page without redirection or reloading. But
testing our Javascript turned out to be more difficult than we anticipated.</p>
<p>We used <a href="http://jasmine.github.io/">Jasmine</a> as our testing framework, but Jasmine by itself is not
sufficient to cover asynchronous AJAX requests and DOM manipulation. We spent quite
a lot of time navigating Stack Overflow responses and asking for help from
coworkers, which led us to wrestle with <a
href="https://github.com/jasmine/jasmine-ajax">jasmine-ajax</a> and <a
href="https://github.com/velesin/jasmine-jquery">jasmine-jquery</a>. Both of
these plug-ins were significantly overpowered for what we were trying to test,
and we got nowhere while trying to figure out how to use them correctly.</p>
<p>Luckily, <a href="http://paytonrules.com">my mentor</a> got back from vacation, and I was able to pick his brain for
assistance. He suggested using <a href="http://sinonjs.org/">sinon</a> and <a
href="https://github.com/searls/jasmine-fixture">jasmine-fixture</a>, which were
both fairly simple libraries with documentation that was easy to decipher.</p>
<p>The first step was to separate the AJAX calls themselves from the DOM
manipulation. I set up my tests using <code>sinon</code>'s handy <code>fakeServer</code> and <code>spy</code>
classes. I told the fake server to give a 200 status response to the type of
AJAX request I was testing and a spy to watch that the AJAX call was made and
successfully completed.</p>
<p>Now to test the DOM manipulation. I had tried to set up HTML fixture files while
experimenting with <code>jasmine-jquery</code>, but that seemed like a lot of redundant code
for testing functions that were pretty much just displaying or hiding elements
on a page. The <code>affix()</code> function from <code>jasmine-fixture</code> was much simpler,
allowing you to quickly set up the elements you needed to test, while also
taking care of cleaning them up after the test was run. I could create an
element, call my DOM manipulation function, then check to see that necessary
change had happened to the element.</p>
<p>These two steps allowed me to create a toolkit of reusable functions for making AJAX GET and
POST requests and for displaying, hiding, and replacing parts of the DOM. Then I
could easily write unit tests for the specific user interactions that combined those
functions. Having the fake server here was especially useful because you could
explicitly test that an element was loading the message body of the response
received from the server after an AJAX request.</p>
<p>We are using <a
href="https://github.com/jnicklas/capybara">Capybara</a> to handle our
acceptance tests. While Capybara is very powerful, you do need to be thoughtful
about how you write any tests that cover functionality involving asynchronous
requests. One <a
href="https://robots.thoughtbot.com/write-reliable-asynchronous-integration-tests-with-capybara">blog
post</a> provides a good summary of the points to keep in mind. However,
something very simple that no blog post or Stack Overflow answer seemed to
explicitly cover is that you need to include <code>:js => true</code> in your test blocks
for Rspec to realize it needs to use the Javascript driver.</p>
<p>Speaking of which, it probably helps to use a headless browser instead of the
Capybara default for the Javascript driver. I ended up going with PhantomJS
through the <a
href="https://github.com/teampoltergeist/poltergeist">Poltergeist</a> gem, since we
are using <a href="https://travis-ci.org/">Travis CI</a> for continuous
integration, and it has PhantomJS already installed.</p>
<p>Finally, note that if your acceptance test involves any interaction with hidden elements
on your page, Capybara may require you to explicitly set
<code>Capybara.ignore_hidden_elements = false</code> (depending on your version).</p>
<p>The source code for our project, still in progress, is available on
<a href="https://github.com/hnlee/reviewr">Github</a>, and a demo is deployed on
<a href="http://reviewr-app.herokuapp.com/">Heroku</a>.</p>Links, the Clojure edition2016-10-12T00:00:00-05:002016-10-12T00:00:00-05:00Hana Leetag:hanalee.info,2016-10-12:/blog/links-the-clojure-edition.html<p>Useful links related to Clojure</p><p>I'm wrapping up my current apprenticeship project, a simple HTTP server, which
has exposed me to Clojure so that I have a functional programming language in my
toolkit alongside an object-oriented one. I do find functional programming far
more intuitive than object-oriented programming: a lot of the concepts have close parallels in mathematics, and
at least on a naive level, it's much more similar to writing
scripts in an imperative paradigm, which is where most of my experience with
programming lies.</p>
<p>In addition to being an example of a functional programming language, Clojure
also belongs to the Lisp family. So yes, it has a lot of nested parentheses, as
well as as reverse Polish notation. I thought the latter would be hard to get used to,
but once I started thinking of it as being similar to
mathematical functional notation, it actually made a lot of sense and led to a
very satisfying degree of consistency in syntax that is one of Clojure's (and
Lisp's) main
attractions. </p>
<p>Here are some interesting Clojure-related links:</p>
<p><a href="http://clojure.org/api/cheatsheet">Clojure Cheatsheet</a>: Quickest way to
navigate Clojure documentation and look up any function in the core.</p>
<p><a href="https://github.com/stuarthalloway/clojure-bowling">clojure-bowling</a>:
Implementation of the bowling game kata in Clojure, using lazy sequences. Very
slick and very true to Clojure idiom.</p>
<p><a href="https://github.com/bbatsov/clojure-style-guide">Clojure Style Guide</a>: Style
guide for Clojure code, which I still haven't fully digested...</p>
<p><a href="https://stuartsierra.com/2016/clojure-how-to-ns.html">How to ns</a>: A whole
article on specifically how to style namespace declarations, by Stuart Sierra,
the same author for the above.</p>
<p><a href="https://8thlight.com/blog/eric-smith/2016/10/05/a-testable-clojurescript-setup.html">TDD in
ClojureScript</a>:
I haven't really done anything in ClojureScript yet, but my mentor wrote this
blog post about how to set up a good TDD environment for a ClojureScript
project.</p>
<p><a href="http://shaunlebron.github.io/t3tr0s-slides/">Tetris in ClojureScript</a>: Speaking
of ClojureScript, a really cool implementation of a browser-based Tetris game.
The slides include the link to the Github repo with all the code.</p>
<p><a href="https://archive.org/details/SICP_4_ipod">SICP lecture videos</a>: Not
strictly Clojure, but since you can't
mention Lisp without thinking of <em>Structure and Interpretation of
Computer Programs</em>...As an alternative or supplementation to working your way
through the text, here are uploaded lecture videos of the MIT course on which
the book was based. Hat tip to the ChiPy mailing list, where I originally got
the link.</p>
<p>...And not really to do with Clojure at all, but:</p>
<p><a href="https://sanctum.geek.nz/arabesque/vim-anti-patterns/">Vim anti-patterns</a>: I've
taken the tips in this article to heart as I continue in my quest to improve my
Vim usage.<sup id="fnref:1"><a class="footnote-ref" href="#fn:1">1</a></sup></p>
<p><a href="http://tomaugspurger.github.io/modern-1.html">Modern pandas</a>: Better ways
to use <code>pandas</code>. I have yet to fully delve into this post, but I listened to a
talk last month on <code>pandas</code> best practices that was really mindblowing, and the
speaker attributed most of his insights to this blog post by one of the main
contributors to the package.</p>
<div class="footnote">
<hr>
<ol>
<li id="fn:1">
<p>Permit me to take a moment here to rant about the term "anti-pattern". It
is really popular in tech/software circles, to the point where it's been
applied to other contexts. E.g. "meeting anti-patterns" or "office layout
anti-patterns". Let me just point out that when first faced with the term, a
reasonably educated layperson would conclude that it referred to the
<em>opposite</em> of a pattern: something that is disorganized or random. But the
actual meaning is a <em>negative</em> pattern that impedes functionality or
productivity. There's really no good way to arrive at this definition unless
you have the assumption that patterns are inherently positive, which I
guess makes sense to programmers who usually associate the word "pattern" with
"design pattern". But to
the rest of the world, patterns are value-neutral. So it makes very little
sense to apply the term to anything that isn't related to software design.
Arguably, if the reasonably educated layperson can't deduce the correct
meaning from breaking down the word, it qualifies as jargon and doesn't belong
in effective technical writing. (Latter rule courtesy of a professor in grad
school, who memorably told us to stop using "Western blotting" and say
"immunoblotting" instead.) I doubt anyone will agree with me or care,
but what other purpose do blogs serve than to provide an outlet for one's
pedantic pet peeves... <a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
</ol>
</div>Sub- and superclasses2016-10-06T00:00:00-05:002016-10-06T00:00:00-05:00Hana Leetag:hanalee.info,2016-10-06:/blog/sub-and-superclasses.html<p>Objects are not nouns but collections of verbs</p><p>We just listened to a zagaku talk on "moving on from inheritance", which in a
lot of ways boiled down to the <a href="https://en.wikipedia.org/wiki/SOLID_(object-oriented_design)">SOLID
principles</a> and
<a href="https://en.wikipedia.org/wiki/Composition_over_inheritance">"composition over
inheritance"</a>.</p>
<p>(8th Light dictionary: zagaku is a quick talk given by one of the crafters to
the apprentices. In Japanese, it means "seated learning" (座学); the Korean reading of those characters
would be 좌학. We have one everyday from Mondays through Thursdays, on all sorts
of topics, usually but not always related to software or consulting.)</p>
<p>Anyway, I realized during the talk, when a diagram was put up to
illustrate how inheritance is used to extend classes, that we say subclass and
superclass and expect these terms in some sense to operate like subsets and
supersets. (Well, maybe you don't if you are an experienced programmer, but I
initially did when first encountering the concept of object-oriented
programming.) </p>
<p>But when you use inheritance to extend a class, the resulting visual
representation becomes counterintuitive. The subclass has added functionality
that does not belong to the superclass. So instead of drawing the subclass
enclosed within the superclass, the subclass contains the superclass instead. That's also essentially what the Liskov substitution principle (LSP), which states that a subclass must be
able to function in whatever context its superclass is called, describes. </p>
<p>I think there's an additional misleading layer where introductory books use the
heuristic "IS-A" to teach how inheritance works. "Dog IS-A
Canine and Wolf IS-A canine" so both Dog and Wolf should be subclasses of
Canines. But <a href="https://en.wikipedia.org/wiki/Circle-ellipse_problem">"Circle IS-A(N)
Ellipse"</a>, and that's one of the classic examples of
LSP violation. The inheritance relationships between objects should not be
determined by categories but by functionality.</p>
<p>So sets are not really a good metaphor for objects at all. Objects are not
really nouns; they are collections of verbs.</p>Notes on Refactoring2016-10-05T00:00:00-05:002016-10-05T00:00:00-05:00Hana Leetag:hanalee.info,2016-10-05:/blog/notes-on-refactoring.html<p>Notes on <em>Refactoring</em> by Martin Fowler</p><p>I had another reading assignment before this one, but there was so much material
that I have three(!) unfinished blog drafts on it. <em>Refactoring</em> is an even
longer book, but most of the material in it is intended as reference, so it's
quicker to write up my notes.</p>
<h3>Defining refactoring</h3>
<blockquote>
<p>the process of changing a software system in such a way that it does not alter
the external behavior of the code yet improves its internal structure</p>
</blockquote>
<ul>
<li>
<p>Also can be thought of as software decay in reverse.</p>
</li>
<li>
<p>Make code easier to understand.</p>
</li>
<li>
<p>Make code easier to modify.</p>
</li>
<li>
<p>Allows you to develop faster because it becomes easier to add new features.</p>
</li>
<li>
<p>Reduces the penalty of making changes to the overall design, which in turn
means less time needed for upfront design. Also encourages simpler design even
if it is less flexible, since it means design can always be changed. </p>
</li>
</ul>
<h3>General recommendations</h3>
<ul>
<li>
<p>Refactor before you add a new feature. Adding a new feature should not involve
changing existing code (if you have correctly implemented open-closed
principle or OCP).</p>
</li>
<li>
<p>Have self-checking automated tests so you can refactor safely.</p>
<ul>
<li>Run tests frequently after each step in a refactoring. If you forget, go
back and redo the refactoring with testing.</li>
<li>Test boundary conditions and expected exceptions.</li>
</ul>
</li>
<li>
<p>Some design principles to keep in mind:</p>
<ul>
<li>A method belongs with the object whose data it uses.</li>
<li>Minimize use of temporary variables and complex conditional logic.</li>
<li>Eliminate duplicate code since more lines of code are more difficult to
maintain.</li>
<li>Clean code should not require you to remember anything about it to be
readable and comprehensible.</li>
</ul>
</li>
<li>
<p>Refactoring can be used to understand unfamiliar code or conduct a code
review or understand a bug.</p>
</li>
<li>
<p>Don't set aside time to refactor. Refactoring should be done all throughout
the development process in frequent, short bursts.</p>
</li>
<li>
<p>Sometimes refactoring can be as simple as renaming variables. Don't hesitate
to rename.</p>
</li>
<li>
<p><strong>Rule of Three</strong>: Refactor the third time you write code that duplicates
similar behavior.</p>
</li>
</ul>
<h3>Indirection</h3>
<ul>
<li>
<p>Software design concept that boils down to having many small modular
components.</p>
</li>
<li>
<p>While it creates more pieces of code to manage, the benefits outweigh the
drawbacks:</p>
<ul>
<li>Easy to share logic between different parts of your code.</li>
<li>Allows (through good naming practices) explanation of the intent behind
each step in your code.</li>
<li>When making modifications, keeps change isolated to one part of the system.</li>
<li>Simplify conditional logic (e.g. in object-oriented paradigm, you can use
the identity of the object rather than branching to specify different
behaviors)</li>
</ul>
</li>
</ul>
<h3>Limitations of refactoring</h3>
<ul>
<li>
<p>Hard to refactor applications with tight coupling to databases</p>
</li>
<li>
<p>May involve changing published interfaces, which requires maintaining old and
new versions</p>
</li>
<li>
<p>Sometimes it is better to rewrite from scratch.</p>
</li>
<li>
<p>When <em>not</em> to refactor:</p>
<ul>
<li>Too many failing tests! Only refactor code that works.</li>
<li>Close to deadline</li>
</ul>
</li>
</ul>
<h3>Performance optimization</h3>
<ul>
<li>
<p>Optimization often makes code harder to read and understand. But refactoring
can make it easier to tune the performance.</p>
</li>
<li>
<p>Most of the time, there is a rate-limiting step that is responsible for
slowing down the program. Don't optimize all parts of the code equally;
identify where this step is (e.g. big-O analysis). </p>
</li>
</ul>
<h3>Code smells</h3>
<p>(There are probably dozens of blog posts that have already regurgitated this
material partially or in full, so I'm just going to include some mnemonics to
help myself remember what each one is about.)</p>
<p><strong>Duplicated code</strong>: Self-explanatory</p>
<p><strong>Long method</strong>: Some heuristics for identification</p>
<ul>
<li>"When we feel a need to comment something, we write a method instead."</li>
<li>A lot of parameters or temporary variables</li>
<li>Conditional logic</li>
<li>Loops</li>
</ul>
<p><strong>Large class</strong>: One sign is too many instance variables</p>
<p><strong>Long parameter list</strong>: Note that this "smell" may end up being necessary in
some cases to avoid unwanted dependencies.</p>
<p><strong>Divergent change</strong>: Basically a single-responsibility principle (SRP)
violation.</p>
<p><strong>Shotgun surgery</strong>: The inverse of divergent change. Could be sort of seen as
a dependency-inversion principle (DIP) violation in some cases?</p>
<p><strong>Feature envy</strong>: Method shouldn't use data that doesn't belong to its object!
(Except for some design patterns like Strategy or Visitor. So think about
trade-offs.)</p>
<p><strong>Data clumps</strong>: Data that tends to be used together need to be in their own
object.</p>
<p><strong>Primitive obsession</strong>: Use objects rather than primitives (e.g. Java's
String).</p>
<p><strong>Switch statements</strong>: Use polymorphism rather than switch-case branching.</p>
<p><strong>Parallel inheritance</strong>: Duplication of class hierarchies.</p>
<p><strong>Lazy class</strong>: Is this class doing anything?</p>
<p><strong>Speculative generality</strong>: Functionality that isn't being used yet.</p>
<p><strong>Temporary field</strong>: Instance variable that isn't being used.</p>
<p><strong>Message chains</strong>: Chains of calls between objects.</p>
<p><strong>Middleman</strong>: Heuristic is that if an object is delegating half its methods to
the same object, those two objects should communicate directly.</p>
<p><strong>Inappropriate intimacy</strong>: Boils down to lack of encapsulation and tightly
coupled classes.</p>
<p><strong>Alternative classes with different interfaces</strong>: Self-explanatory</p>
<p><strong>Incomplete library class</strong>: Situation where you need to make a slight
modification to a library.</p>
<p><strong>Data class</strong>: Classes that do nothing but hold data that is being called by
other classes.</p>
<p><strong>Refused bequest</strong>: Not always a problem. But one example of when it is: a
subclass that doesn't support the interface of its superclass.</p>
<p><strong>Comments</strong>: Not inherently bad but usually pop up to mask the lack of
readability or comprehensibility. Address the underlying problem directly and
need for comment usually disappears.</p>
<h3>Refactoring tools</h3>
<p>I wonder if a refactoring tool rather resembles a compiler, since it needs to
understand code syntax in much the same way.</p>
<ul>
<li>
<p>Help integrate refactoring into regular development process by reducing the
cost of refactoring.</p>
</li>
<li>
<p>Allows for safe refactoring without needing to rerun tests.</p>
</li>
<li>
<p>Uses parse trees to represent the internal structure of a method.</p>
</li>
<li>
<p>Needs to be safe and accurate.</p>
</li>
</ul>Links, the design principles edition2016-09-15T00:00:00-05:002016-09-15T00:00:00-05:00Hana Leetag:hanalee.info,2016-09-15:/blog/links-the-design-principles-edition.html<p>Useful links related to software design principles</p><p>I was gently reminded today that I haven't kept up with writing blog posts,
despite the original goal of posting three times a week. Oops! Currently,
I'm reading <em>Agile Software Development: Principles, Patterns, and
Practices</em>, which has been rather like drinking from a firehose of
concepts and code examples. I've moved on from working on a Java implementation
of tic-tac-toe to building an HTTP server in Clojure, which has propelled me to
finally figuring out what GET and POST requests are all about.</p>
<p>I tend to collect links in my private DM channel on the <a href="http://8thlight.com">8th Light</a> Slack, so I'll start off this
return to (hopefully regular) blogging with another list of useful reading:</p>
<p><a href="http://agileinaflash.blogspot.com/2012/06/simplify-design-with-zero-one-many.html">Simplify Design with Zero, One, Many</a>: I've found "zero, one, many" to be a really useful guideline to turn to when I'm trying to figure out which unit test to write next or when I'm checking for test coverage. Any loop in your code should be covered by a test for the zero case, the singleton case, and the "many" case (usually suffices to do two).</p>
<p><a href="http://arlobelshee.com/good-naming-is-a-process-not-a-single-step/">Good naming is a process, not a single
step</a>:
A good series of posts on naming variables in your code. I find that as I
improve my sense of design, finding names that fully and accurately describe
a component's "single responsibility" is crucial to helping me refactor my code
and make it more comprehensible.</p>
<p><a href="http://butunclebob.com/ArticleS.UncleBob.TheThreeRulesOfTdd">The Three Rules of
TDD</a>:
A classic Uncle Bob post. A slightly different formulation of the TDD cycle
that I find helpful because of two points in particular.</p>
<ol>
<li>
<p>Treating compiler failures as failing tests, since I frequently get compiler failures.</p>
</li>
<li>
<p>Only writing enough production code to make a test pass, since one of the
recurring themes in code reviews from my mentors is that I tend to come up
with something more complicated than what the test (and the actual problem
being solved) calls for.</p>
</li>
</ol>
<p><a href="http://martinfowler.com/bliki/BeckDesignRules.html">Beck Design
Rules</a>: Another take on
how to avoid complicated code: a list of priorities that fit in very well with
the TDD cycle. First, you write code that meets the first priority, passing
the tests. Then, you refactor to remove duplication and express the intent
of your code. (The post talks very usefully about how sometimes there
might be tension between those two priorities and how to find the right
balance.) Finally, the last priority, fewest elements, forces you to
remove anything redundant or not being used (another "code smell" that is
prone to popping up in my code).</p>
<p><a href="https://github.com/sf105/goos-code">Code for the auction sniper application in
GOOS</a>: <a href="goos_notes.html"><em>Growing
Object-Oriented Software</em></a> spends the bulk of its volume on an example of an auction sniper software, which is worked through in
detail to illustrate the design decisions and processes that the authors have
laid out. This Github repository contains all the code, and I rather wish I had
known about it when reading the book! A lot of the ideas that the book talks
about only really became clear to me after I had mostly finished my
tic-tac-toe in Java, to the effect that I would probably go about designing a
tic-tac-toe very differently if I had to start over from scratch now. I'm
trying to keep them in mind as I write my Clojure server, though
there is a bit of a translation process
due to Clojure being a functional language rather than an object-oriented one.</p>Data visualization and graphics in Python2016-08-18T00:00:00-05:002016-08-18T00:00:00-05:00Hana Leetag:hanalee.info,2016-08-18:/talks/data-visualization-and-graphics-in-python.html<p>Or, where is my <code>ggplot2</code>?!</p><p>The following talk was presented at the Chicago Python User Group (ChiPy)'s
Scientific Special Interest Group (SIG) meeting on 2016 August 17.</p>
<p><strong>Slides:</strong> <a href="http://hnlee.github.io/talks/pyviz/">http://hnlee.github.io/talks/pyviz/</a></p>
<p><strong>Notebook:</strong> <a href="https://github.com/hnlee/talks/blob/master/pyviz/pyviz.ipynb">https://github.com/hnlee/talks/blob/master/pyviz/pyviz.ipynb</a> </p>Notes on data visualization and graphics in Python2016-08-18T00:00:00-05:002016-08-18T00:00:00-05:00Hana Leetag:hanalee.info,2016-08-18:/blog/notes-on-data-visualization-and-graphics-in-python.html<p>Or, where is my <code>ggplot2</code>?!</p><p>Yesterday, I gave a talk at ChiPy's Scientific SIG meeting, providing an
overview of the data visualization packages in Python, from the perspective of
a former scientist who switched from using R to Python for large-scale data
analysis. My talk was framed around the current limitations in Python's data
visualization packages and what improvements I would like to see. Here is a written, and considerably more polished, version of what I spoke
about.</p>
<h3>What we talk about when we talk about data visualization</h3>
<p>It's hard to talk about "Big Data" because it is
complicated. What does that mean? There are many variables, both continuous and
discrete, exhibiting patterns and trends that cannot be easily modeled
linearly and frequently riddled with a lot of noise. The data does not yield
easy takeaways that can be summarized in a single phrase. Thus, it becomes
more important than ever that we have tools to not only analyze our data
correctly but also to communicate our findings to others.</p>
<p>That's the importance of data visualization. Edward Tufte, who is one of the
pioneers in this field, sums up the criteria for an
effective graphic in this quote from his classic, <em>Visual Display of
Quantitative Information</em>: </p>
<blockquote>
<p>Graphical excellence is that which gives to the viewer the greatest number of
ideas in the shortest time with the least ink in the smallest space.</p>
</blockquote>
<p>It's important that we convey the overall message as well as key details in a
format that is clear, simple, and easy to interpret.</p>
<p>We can start to develop some guidelines for creating such "excellent" data
graphics by looking at how humans process visual information. How do
we distinguish between different quantities? There's a maxim that pie charts are
much worse than bar charts for displaying categories with similar amounts
because our eye is better able to notice small differences in length than area
or angle. Another question to ask is how we notice relationships among
individual data points. Graphical elements that are located close
together or share similar properties, like color or shape, or even directly
creating connections or boundaries help the viewer
draw connections between data components.</p>
<p>Broader considerations include thinking about directing the viewer's attention:
where does their eye fall and are there indicators that show where to look next?
Possibly most important of all, every data graphic has a story to tell, and
the success of any data visualization depends on having a coherent narrative. "A picture is worth a thousand words," but we need to know
what those words are in the first place.</p>
<h3>ggplot2 and a grammar of graphics</h3>
<p>Another pioneer in the field, Leland Wilkinson, tried to
systematize how we go about creating data
visualizations through what he called a "grammar of graphics".It breaks
down the components of a plot into the following abstract elements:</p>
<ul>
<li>Aesthetics, which visually represent data variables</li>
<li>Geometric objects, which depict data points</li>
<li>Scales and coordinates, which communicate information about quantities</li>
<li>Statistics, which are transformations applied to the data to illustrate analysis</li>
<li>Facets, which are how multiple related plots can be tied together in a graphic</li>
<li>Annotations, which are text labels applied to a plot</li>
</ul>
<p>These elements form the foundation for the R library, <code>ggplot2</code>,
which has set a high standard for data visualization packages and is
widely used in the R community. Its particular strengths are a clear and
consistent syntax, based on the grammar of graphics; a layering system,
which allows you to quickly add new elements to a plot; and a faceting system,
which makes it easy to generate data graphics with multiple related plots, when
dealing with complicated data.</p>
<p>Of particular note is how simple it is to apply statistical functions, like
calculating the mean and standard error to summarize a lot of data points, and
useful, attractive plotting methods, like <code>geom_smooth()</code>, which automatically
applies a smoothing loess function to your data and draws a line on top of a
translucent ribbon that represents the 95% confidence interval.</p>
<p>It also makes it easy to go from a single plot to multiple plots split across
an additional variable. In my code examples, I show how I go from a single plot
depicting mean car mileages measured during city and highway driving over time to multiple plots depicting how this
relationship changes with engine displacement.<sup id="fnref:1"><a class="footnote-ref" href="#fn:1">1</a></sup> The final version
illustrates the trend among four different variables (miles per gallon, year,
type of mileage measured, and engine displacement), in an uncluttered
two-dimensional graphic that is easy to interpret. </p>
<p>So if this R library is so awesome, why would we ever want to use Python
instead? Well, Python has quite a few advantages over R for data science,
particularly when we are dealing with large amounts of data. It is overall
faster and more efficient than R, and because it is a general purpose
programming language, it is easier to integrate into applications that
may want to use the results of your data analysis. It has highly readable
syntax and testing frameworks that make it easier to write robust code. In an
industry setting, Python is much better suited to deploying and releasing
data-based products, which is why it is more widely used.</p>
<h3>What's available in Python</h3>
<p>When it comes to data visualization packages for Python, options:
<code>matplotlib</code>, <code>seaborn</code> and <code>ggplot</code>. All of these fall short of R's
<code>ggplot2</code> and have certain limitations that stop them from being useful for
generating excellent data graphics. I'll start with <code>matplotlib</code>, which
is probably the most widely used library.</p>
<p><code>matplotlib</code> is a port of the Matlab library. I often joke that I haven't met
a single person who enjoys coding in Matlab. Unfortunately, that also applies
to <code>matplotlib</code>. A lot of people coming to Python are already familiar
with this library from Matlab, and it does have powerful methods for
rendering 3D or interactive plots. But <code>matplotlib</code> has several pitfalls.
Its syntax does not take any advantage of Python's clarity and is difficult to
read. Without customization, it generates some truly ugly plots...and it
doesn't make that customization easy to do either. (The plots shown in the
slides are actually much better than what the default <code>matplotlib</code> style
looks like, because Jupyter notebooks automatically apply a style that is meant
to resemble the default theme for <code>ggplot2</code>.) It's certainly possible to
"prettify" <code>matplotlib</code> if you dig down into its functions to control the
appearance of your plots, but it doesn't make it easy. Finally, <code>matplotlib</code>
doesn't have any functions that actually deal with subsetting or
transforming your data. For example, you have to manually write code to
filter your data by categories if you want to show data points in multiple
colors on a single plot. The lack of such higher-level functions make
using <code>matplotlib</code> unsatisfactory for data visualization despite all its
power as a graphics package. </p>
<p><code>seaborn</code> is a wrapper for <code>matplotlib</code> that simplifies its syntax and
generates much more attractive looking graphs. It also has some data handling
functions similar to <code>ggplot2</code> that facilitate the process of making more
complex plots. However, there are some obvious gaps in its functionality,
which seems to be reflected in how it also seems to have incomplete
documentation. In particular, what I've noticed is that it has a limited range
of plot types, each of which is fairly inflexible in how it will handle data.
In my code examples, I show how the <code>stripplot()</code> and <code>factorplot()</code>
functions treat the variable for year as discrete rather than continuous
data, which means that labels along the x-axis end up unreadable as
every value for year is treated as a separate category. There's also no
support for more specialized plots, like contour plots. The methods available
for customizing text labels and annotations are limited and have inconsistent
syntax from plot type to plot type. In some cases, it requires drilling
down into the underlying <code>matplotlib</code> code to get the effect you want. So while
there are many promising aspects to <code>seaborn</code>, it seems that it needs further
development to be truly mature as a data visualization tool.</p>
<p>The last option, which may be the least known or used out of the three, is
<code>ggplot</code>, a Python port of <code>ggplot2</code> that is being developed by Ŷhat. Now
at first, this package sounds like the answer we've been wanting: a way to
use <code>ggplot2</code>'s powerful approach to data visualization in a Python environment.
However, it is also an incomplete package still under development, which
becomes clear when you look at its
<a href="http://ggplot.yhathq.com/docs/index.html">documentation</a>, where many pages
are completely empty. What's great is that it emulates <code>ggplot2</code>'s syntax,
breaking down a plot into abstract elements that helps you think about how to
visually represent your data in an effective way. What's not so great is that
it's missing a lot of key functions for statistical transformations and
plot types. In the code examples, when I attempt to use the <code>geom_jitter()</code>
method to prevent overplotting, there is no actual jittering on the points. The
implementation of <code>geom_smooth()</code> also seems to have bugs in how it calculates
the 95% confidence interval and how it applies loess smoothing to the
data. And faceting just creates a visually unappealing, proportionally
unbalanced set of plots. In short, there needs to be more development of this package
in order for it to be fully usable.</p>
<h3>Where do we go from here?</h3>
<p>When compared to the kind of data graphics we can produce in R with <code>ggplot2</code>,
the status quo in Python is far from satisfactory...but we don't have to be demoralized.
I think we can look at the situation as an opportunity: to create a Python
data visualization package that can rival <code>ggplot2</code> in its power and ease of
use. <a href="https://github.com/matplotlib/matplotlib"><code>matplotlib</code></a>,
<a href="https://github.com/mwaskom/seaborn"><code>seaborn</code></a> and
<a href="https://github.com/yhat/ggplot"><code>ggplot</code></a> are all open-source projects to
which Python programmers and data scientists can contribute. Or perhaps the
Python community can build a new package from the ground up that has
<code>ggplot2</code>'s functionality.</p>
<p>I would like to see a data visualization tool that implements what is
key to <code>ggplot2</code>'s success in the R community. People tend to try to imitate
the look and style of <code>ggplot2</code> graphics and stop there. But the core innovation of
<code>ggplot2</code> is how it utilizes the "grammar of graphic" concepts to
organize how we construct plots. There's no reason that we can't create that
in Python.</p>
<p>In the meantime, we also have the option of using R and Python together,
which is particularly easy to do in a Jupyter notebook environment with
the <code>rpy2</code> package. (Or if you want to run Python from R, there is the
<code>rPython</code> library on CRAN.)</p>
<p>As a footnote, while I've focused on static data graphics here, there are
also tools available for constructing interactive plots in both languages,
<code>ggvis</code> for R and <code>bokeh</code> for Python, which I encourage people to check out
as well. </p>
<p>For code examples and pretty plots, please look at my <a href="https://github.com/hnlee/talks/blob/master/pyviz/pyviz.ipynb">Jupyter
notebook</a> or my
<a href="http://hnlee.github.io/talks/pyviz/">slides</a>.</p>
<div class="footnote">
<hr>
<ol>
<li id="fn:1">
<p>The data for all the plots shown in my slides comes from an
open-source data set provided by
the <a href="http://fueleconomy.gov">EPA</a>, processed and made available as an R
package, <a href="https://github.com/hadley/fueleconomy"><code>fueleconomy</code></a>. <a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
</ol>
</div>Growing Object-Oriented Software, Guided by Tests2016-08-11T00:00:00-05:002016-08-11T00:00:00-05:00Hana Leetag:hanalee.info,2016-08-11:/blog/growing-object-oriented-software-guided-by-tests.html<p>Notes on <em>Growing Object-Oriented Software, Guided by Tests</em> by Steve Freeman and Nat Pryce</p><p>This book focuses on how to do test-driven development with object-oriented
software, but it also ended up introducing to me several general principles
about software design and defining vocabulary that are routinely referenced in
articles and blog posts about software. In other words, it is dense and provides
a lot to digest.</p>
<p>The first third of the book lays out the concepts and then most of the last
two-thirds work through an example of building an auction sniper application.</p>
<h3>Definitions</h3>
<ul>
<li>
<p><strong>acceptance test</strong>: tests the functionality of a feature and how the whole
system operates</p>
</li>
<li>
<p><strong>integration test</strong>: tests how code interacts with external/invariant code</p>
</li>
<li>
<p><strong>unit test</strong>: tests the functionality of a single object</p>
</li>
<li>
<p><strong>end to end test</strong>: tests the system as if it were a black box and only
interacts with it through the UI</p>
</li>
<li>
<p><strong>edge to edge test</strong>: tests every step from build to deployment to release</p>
</li>
<li>
<p><strong>coupling</strong>: change in one component forces change in another (i.e. the
modularity, or lack thereof, of the system)</p>
</li>
<li>
<p><strong>cohesion</strong>: responsibilities form a meaningful unit</p>
</li>
<li>
<p><strong>role</strong>: related group of responsibilities</p>
</li>
<li>
<p><strong>responsibility</strong>: obligation to either perform a task or know information</p>
</li>
<li>
<p><strong>collaboration</strong>: interaction between objects or roles</p>
</li>
<li>
<p><strong>mockery</strong>: jMock term that refers to the context of the object being tested</p>
</li>
<li>
<p><strong>mock object</strong>: a test object that substitutes for objects that interact with
the object under test</p>
</li>
<li>
<p><strong>expectations</strong>: rules that define how mock objects should be invoked</p>
</li>
<li>
<p><strong>"walking skeleton"</strong>: most minimal implementation necessary to have an
end-to-end test</p>
</li>
<li>
<p><strong>encapsulation</strong>: behavior of an object can only be affected through
methods for interaction with other objects</p>
</li>
<li>
<p><strong>information hiding</strong>: how object functions remains internal and invisible to
other objects</p>
</li>
<li>
<p><strong>aliasing</strong>: sharing references to mutable objects, breaks encapsulation</p>
</li>
<li>
<p><strong>peers</strong>: objects with which a given object communicates </p>
</li>
<li>
<p><strong>dependencies</strong>: services from peers without which the object cannot
function</p>
</li>
<li>
<p><strong>notifications</strong>: peers that need to be updated with the object's
behavior or status</p>
</li>
<li>
<p><strong>adjustments</strong>: peers that adjust the object's behavior to work with
the rest of the system</p>
</li>
<li>
<p><strong>context independence</strong>: object has no internal knowledge about its
environment</p>
</li>
<li>
<p><strong>interface</strong>: "whether two components will fit together"</p>
</li>
<li>
<p><strong>protocol</strong>: "whether two components will work together"</p>
</li>
<li>
<p><strong>spike</strong>: initial code written to figure out what to do, later rolled back
and rewritten more cleanly</p>
</li>
<li>
<p><strong>implementation layer</strong>: describes how the code will do something</p>
</li>
<li>
<p><strong>declarative layer</strong>: describes what the code will do</p>
</li>
</ul>
<h3>Key Ideas</h3>
<ul>
<li>
<p>In general, we want <em>low</em> coupling and <em>high</em> cohesion.</p>
</li>
<li>
<p>Object-oriented design can be thought of as the network of communications
among the objects in your software system.</p>
</li>
<li>
<p>Objects are mutable; values are immutable. </p>
</li>
<li>
<p>Interfaces help define an object's roles.</p>
</li>
<li>
<p>"Tell, Don't Ask" or <a href="https://en.wikipedia.org/wiki/Law_of_Demeter">"Law of
Demeter"</a></p>
</li>
<li>
<p>Mock objects are used to test interactions between objects.</p>
</li>
<li>
<p>Begin with a "walking skeleton".</p>
</li>
<li>
<p>Start each new feature with an acceptance test to determine how the new feature
will function.</p>
</li>
<li>
<p>Separate acceptance tests for completed features to catch bugs vs acceptance
tests for new features in progress.</p>
</li>
<li>
<p>Write unit tests for object behavior rather than the object's methods.</p>
</li>
<li>
<p>Unit tests check the internal quality of the code; acceptance tests check the
external quality.</p>
</li>
<li>
<p>Something that is difficult to test is probably badly designed.</p>
</li>
<li>
<p><a href="https://en.wikipedia.org/wiki/Single_responsibility_principle">Single Responsibility
Principle</a>:</p>
</li>
</ul>
<blockquote>
<p>Our heuristic is that we should be able to describe what an object does
without using any conjunctions ("and," "or").</p>
</blockquote>
<ul>
<li>
<p>Interacting with the composite object should be simpler than interacting with
the components that compose it.</p>
</li>
<li>
<p>"Mock an object's peers [...] but not its internals."</p>
</li>
<li>
<p>Techniques for introducing new objects:</p>
<ul>
<li>"Breaking out": when code for an object becomes too complex, separate it into smaller
units </li>
<li>"Budding off": placeholder for a new object, to be filled in with more
implementation details later</li>
<li>"Bundling up": creating a new object for a group of objects that are always
used together</li>
</ul>
</li>
<li>
<p>When to break out:</p>
</li>
</ul>
<blockquote>
<p>Break up an object if it becomes too large to test easily, or if its test
failures become difficult to interpre. Then unit-test the new parts
separately.</p>
</blockquote>
<ul>
<li>When to bud off:</li>
</ul>
<blockquote>
<p>When writing a test, we ask ourselves, "If this worked, who would know?" If
the right answer to that question is not in the target object, it's probably
time to introduce a new collaborator.</p>
</blockquote>
<ul>
<li>When to bundle up:</li>
</ul>
<blockquote>
<p>When the test for an object becomes too complicated to set up [...] consider
bundling up some of the collaborating objects.</p>
</blockquote>
<ul>
<li>
<p>Use interfaces to name roles played by objects. Keep interfaces narrow in
scope.</p>
</li>
<li>
<p>Goal is to move to "higher-order" programming: "composing programs from
smaller programs".</p>
</li>
<li>
<p>Don't use mocks for third-party code, since it is usually not changeable.
Use an adapter layer to implement interactions with third-party code.</p>
</li>
</ul>Links, the network analysis edition2016-08-08T00:00:00-05:002016-08-08T00:00:00-05:00Hana Leetag:hanalee.info,2016-08-08:/blog/links-the-network-analysis-edition.html<p>Useful links from the mastery cohort with Bobby Norton.</p><p>One of the many skill development opportunities available at 8th Light is
a "mastery cohort", where an acknowledged "master" with years of experience in
the software industry is invited to come give a day-long workshop for 8th Light
crafters. Last Friday, <a href="http://www.bobbynorton.com">Bobby Norton</a> came to give a
mastery cohort on data science and network analysis, and apprentices were
invited to participate. I was pretty enthusiastic to sign up because I've done a
lot of dabbling while studying genomics and systems biology; a lot of my
research interests as a scientist revolved around the emergent properties of
complex systems and understanding the behavior of gene regulation and metabolic
networks. I haven't studied graph theory at a rigorous level, other than reading
some papers by Barabási and doing a course project on "subgraph" patterns
(incidentally, also the project where I took the opportunity to first teach
myself Python!) way back in second year of graduate school, but the subject
continues to fascinate me.</p>
<p>The Big Idea of the cohort was to apply the ideas from network analysis and
graph theory to software, which is in and of itself a complex system, and to see
if the insights that emerge could be used to identify the characteristics of
well-designed software and (conversely) diagnose problems in code more quickly.
It's one of those ideas that as soon as you hear it, you wonder why it hasn't
been done already! We looked at some tools for representing and visualizing
graphs and networks as well as Bobby Norton's own library written in Clojure for
analyzing dependencies and functions in Clojure codebases.</p>
<p><a href="http://visjs.org/">vis.js</a>: Javascript-based visualization library that can be
used to draw graphs and network diagrams</p>
<p><a href="https://gephi.org/features/">Gephi</a>: A visualization GUI tool that is fairly
popular and can be used for exploring network data.</p>
<p><a href="https://www.yworks.com/products/yed">yEd</a>: Similar to Gephi, with a slicker
interface. Main disadvantage is that there is no support for weighted edges.</p>
<p><a href="http://graphml.graphdrawing.org/index.html">GraphML</a>: XML-based standard for
representing graph data. File format should be supported by most GUIs.</p>
<p><a href="https://www.polinode.com/">Polinode</a>: Cloud-based platform for network
analysis.</p>
<p><a href="http://igraph.org/">iGraph</a>: Package for network analysis. Available for
<a href="http://igraph.org/r/">R</a> and <a href="http://igraph.org/python/">Python</a>.</p>
<p><a href="https://github.com/testedminds/edgewise">edgewise</a>: Clojure library for network
analysis.</p>
<p><a href="https://github.com/bobbyno/lein-topology">lein-topology</a>: Generates graph data
for a given Clojure library, can be used in conjunction with <code>edgewise</code> above to
analyze the software network structure.</p>
<p>Also, I don't know any Clojure yet, but since so much of the day was spent
with Clojure-based tools, I did collect a few links for learning Clojure:</p>
<p><a href="http://www.4clojure.com/problems">4Clojure</a>: A koan/kata-like site for learning
Clojure through exercises.</p>
<p><a href="https://github.com/JonyEpsilon/gorilla-repl">Gorilla</a>: A notebook-type REPL for
Clojure, kind of like Jupyter and apparently better than the actual Clojure
kernel for Jupyter (<a href="https://github.com/roryk/clojupyter">CloJupyter</a>), which
has a few bugs.</p>
<p>Some ideas I had for studies that could be done investigating the network analysis
of software:</p>
<ul>
<li>
<p>Get network representations of many software libraries in a given language,
and have some independent metric(s) for either performance or design quality.
Show statistically significant correlations between network properties and
such metrics.</p>
</li>
<li>
<p>Time series showing how the network architecture of software evolves while
moving from a "small-scale" to "large-scale" project.</p>
</li>
<li>
<p>Do different types of languages lend themselves to different types of network
topologies? (E.g. functional vs object-oriented vs procedural.)</p>
</li>
<li>
<p>Do different "subgraphs" correlate with well-known "design patterns" outlined
in best practices for software design?</p>
</li>
</ul>
<p>I wonder if any of these have already been done in the theoretical computer
science field. Worth searching on arXiv, one of these days.</p>Useful reading2016-07-31T00:00:00-05:002016-07-31T00:00:00-05:00Hana Leetag:hanalee.info,2016-07-31:/blog/useful-reading.html<p>A roundup of interesting links.</p><p>Over the past two weeks, I've come across several useful articles, blog posts,
and cheatsheets worth revisiting, either because they provide good reference
information or talk about concepts that will take some time for me to fully
digest. So I've decided to do a (semiregular?) roundup of links so I can spare my
browser the neverending tab spawn.</p>
<p><a href="https://8thlight.com/blog/uncle-bob/2013/05/27/TheTransformationPriorityPremise.html">The Transformation Priority
Premise</a>:
My mentor sent me this blog post to read last week. It is an attempt to
systematize, in a way, how one chooses to implement code that makes a unit test
pass. What I find particularly intriguing is the idea that there could be
generalizable principles that underlie the TDD process. TDD itself to me seems
like a generalizable approach to software development, as Uncle Bob himself
suggests towards the end of this post:</p>
<blockquote>
<p>The sequence of tests, transformations, and refactorings may just be a formal
proof of correctness.</p>
</blockquote>
<p>(Beck also alludes to this possibility at the end of <em>Test-Driven Development by
Example</em> when he mentions the behavior of complex systems. On my own end, I can
see several parallels to the process of biological evolution; maybe that can be
a topic for a future blog post.) Anyway, maybe if you can systematize the
"transformations" during TDD, maybe you can also build software that can write
software. (I guess that is already being done, to a certain extent, with machine
learning, but I don't think they incorporate TDD. I should read up on this
subject more; it's so satisfyingly recursive.)</p>
<p><a href="http://ieftimov.com/tdd-humble-object">TDD Patterns: Humble Object</a>: I think I
stumbled across this link while reading Stack Overflow. I suspect it will become
especially useful when I start trying to build the actual I/O part of my current
project (building a command-line tic-tac-toe in Java). I also appreciate the API
example because I've been wondering how you do TDD on web development projects,
and this post illustrates that pretty well. </p>
<p><a href="https://github.com/Droogans/unmaintainable-code">How to Write Unmaintainable
Code</a>: Really hilarious but
also describes a number of bad habits that I would like to remember to avoid.</p>
<p><a href="http://jgthms.com/web-design-in-4-minutes/">Web Design in 4 Minutes</a>: Not
directly related to anything I'm doing at work right now but I have two side
projects involving building web sites, and I'm pretty weak on the front-end part.</p>
<p><a href="http://randsinrepose.com/archives/how-i-slack/">How I Slack</a>:
All the Slack keyboard shortcuts one could ever need.</p>
<p><a href="http://vim.rtorr.com/">Vim Cheat Sheet</a>: I've been using <code>vim</code> for almost six
years now as my text editor of choice, but I definitely do <em>not</em> use it to full
capacity.</p>
<p><a href="https://tmuxcheatsheet.com/">Tmux Cheat Sheet</a>: Recently switched from using
<code>screen</code> to <code>tmux</code>, so I'm still in the process of trying to memorize the
commands.</p>Prime Factors coding kata2016-07-26T00:00:00-05:002016-07-26T00:00:00-05:00Hana Leetag:hanalee.info,2016-07-26:/blog/prime-factors-coding-kata.html<p>Thoughts on practicing the Prime Factors coding kata</p><p>The concept of a coding kata originates with <a href="http://codekata.com/kata/codekata-how-it-started/">Dave
Thomas</a> who based it on his
practice of karate. When I started looking around for potential katas, I found a
lot of blog posts that argued over what the purpose of kata was and whether they
were useful. In particular, there seems to be conflict over whether you are
supposed to perform kata the same way every time or use it as an opportunity to
explore the solution space to an interesting problem.</p>
<p>My prior experience with kata up until now have been via a different Japanese
martial art, kendo. I imagine kata function more or less similarly there as they
do in karate: a series of choreographed movements that are supposed to
demonstrate important principles and ideal form. What becomes clear as a
beginner is that you may be doing the same sequence of moves each time but you
never really do manage to move in the exact same way. One time, you may hold
your bokuto (the wooden blade that you use for kata) a centimeter higher; the
next time, the third step you take might land you slightly closer or further
from your opponent. (Kendo kata are always done in pairs, simulating an exchange
of strikes between two swordsmen.) When performed mindfully, kata should help
the practitioner achieve new insight into their art.</p>
<p>The other important aspect of kata is that there is usually a Platonic ideal
form to which one can aspire to approach asymptotically but probably not
actually achieve. (Unless you are a short old hachidan with a pot belly that
nevertheless moves like lightning.)</p>
<p>Anyway, practicing coding kata has some similarities and some differences. I
haven't repeated my chosen kata all that many times yet, but I definitely do it
slightly differently each time I tackle it from memory. On the other hand, I
don't think one can claim there is one perfect sequence of keystrokes for a
given coding problem.</p>
<p>I've been practicing the Prime Factors kata, which was
originally characterized by <a href="http://butunclebob.com/ArticleS.UncleBob.ThePrimeFactorsKata">Uncle
Bob</a>. It's a
fairly simple algorithm used to generate prime factors of any natural number.
The first time, I followed Uncle Bob's walkthrough step for step; since then,
I've been going through the kata, TDD-style, on my own. What I've been wondering
how to handle is the fact that I remember the end solution and start to skip
steps in the process of getting there.</p>
<p>For example, after writing the test to factor 2, the walkthrough implements the
following to make it pass:</p>
<div class="highlight"><pre><span></span><span class="k">if</span> <span class="o">(</span><span class="n">number</span> <span class="o">></span> <span class="mi">1</span><span class="o">)</span> <span class="o">{</span>
<span class="n">primes</span><span class="o">.</span><span class="na">add</span><span class="o">(</span><span class="mi">2</span><span class="o">);</span>
<span class="o">}</span>
</pre></div>
<p>But what I habitually come up with is:</p>
<div class="highlight"><pre><span></span><span class="k">if</span> <span class="o">(</span><span class="n">number</span> <span class="o">></span> <span class="mi">1</span><span class="o">)</span> <span class="o">{</span>
<span class="n">primes</span><span class="o">.</span><span class="na">add</span><span class="o">(</span><span class="n">number</span><span class="o">);</span>
<span class="o">}</span>
</pre></div>
<p>Which also means that I never write a test for factoring 3 because then it seems
like a trivial case.</p>
<p>Another example is after writing the test to factor 4, the walkthrough uses the
following code to make it pass:</p>
<div class="highlight"><pre><span></span><span class="k">if</span> <span class="o">(</span><span class="n">number</span> <span class="o">></span> <span class="mi">1</span><span class="o">)</span> <span class="o">{</span>
<span class="k">if</span> <span class="o">(</span><span class="n">number</span> <span class="o">%</span> <span class="mi">2</span> <span class="o">==</span> <span class="mi">0</span><span class="o">)</span> <span class="o">{</span>
<span class="n">primes</span><span class="o">.</span><span class="na">add</span><span class="o">(</span><span class="mi">2</span><span class="o">);</span>
<span class="n">number</span> <span class="o">/=</span> <span class="mi">2</span><span class="o">;</span>
<span class="o">}</span>
<span class="k">if</span> <span class="o">(</span><span class="n">number</span> <span class="o">></span> <span class="mi">1</span><span class="o">)</span>
<span class="n">primes</span><span class="o">.</span><span class="na">add</span><span class="o">(</span><span class="n">number</span><span class="o">);</span>
<span class="o">}</span>
</pre></div>
<p>Whereas I again end up jumping ahead of myself to:</p>
<div class="highlight"><pre><span></span><span class="kt">int</span> <span class="n">factor</span> <span class="o">=</span> <span class="mi">2</span><span class="o">;</span>
<span class="k">while</span> <span class="o">(</span><span class="n">number</span> <span class="o">></span> <span class="mi">1</span><span class="o">)</span> <span class="o">{</span>
<span class="n">primes</span><span class="o">.</span><span class="na">add</span><span class="o">(</span><span class="n">factor</span><span class="o">);</span>
<span class="n">number</span> <span class="o">/=</span> <span class="n">factor</span><span class="o">;</span>
<span class="o">}</span>
</pre></div>
<p>Now that I lay out them out side by side like that, it seems apparent that I am
always choosing to go one or two levels of abstraction beyond what the unit test
immediately requires. Hmmm.</p>
<p>It also seems that one can use coding kata to explore multiple routes to the
same destination: i.e. that in some senses, one can and should use the
opportunity to do the kata differently each time. Some kata seem more suited
for this approach than others. So far, the only thing that I have tried is to
test the performance of the algorithm on large primes. (It will work for any
number up to the memory limit on Java's <code>int</code> type.) Some ideas that I would
like to play around with once I am reliably able to reproduce the steps in the
walkthrough from memory:</p>
<ul>
<li>Handling negative numbers (should the code return an error or the prime
factors of the absolute value?)</li>
<li>Handling numbers that have to be stored in <code>long</code> because they are too
big for <code>int</code></li>
<li>Deriving one of the well-known factorization algorithms (e.g. <a href="https://en.wikipedia.org/wiki/Fermat%27s_factorization_method">Fermat's
method</a> or
<a href="https://en.wikipedia.org/wiki/Euler%27s_factorization_method">Euler's
method</a>) via
TDD (is that possible?!)</li>
</ul>Test-Driven Development by Example2016-07-23T00:00:00-05:002016-07-23T00:00:00-05:00Hana Leetag:hanalee.info,2016-07-23:/blog/test-driven-development-by-example.html<p>Notes on <em>Test-Driven Development by Example</em> by Kent Beck</p><p>I finished working through the two examples in this book as well as reading
through the last part on common "patterns" or principles to help generalize the
process of test-driven development. For the first part, building a multicurrency
calculator, I directly followed the book. The second part, which builds the core
automated testing functions of xUnit, was demonstrated in Python, so I used the
opportunity to practice and followed along in Java. My code is up at
<a href="http://github.com/hnlee/tddbyexample">Github</a> and undergoing review by my
mentor and co-mentor.</p>
<p>I jotted down a lot of notes while reading through the book, particularly the
last section, which I suspect I will end up revisiting as I gain more experience
with test-driven development. I'll try to present them here in a semi-organized
fashion.</p>
<h3>Tips for writing tests</h3>
<ul>
<li>I found the following quote worth keeping in mind: </li>
</ul>
<blockquote>
<p>You will likely end up with about the same number of lines of test code as
model code when implementing TDD (p. 78)</p>
</blockquote>
<p>As a note, this principle does not mean you end up writing more total lines of code,
since the TDD process also encourages you to keep your model code simple.</p>
<ul>
<li>
<p>Overall, the number of test classes should equal the number of model classes
(although you don't necessarily need an exact one-to-one correspondence).</p>
</li>
<li>
<p>Tests should be independent of one another. They should be able to run in any
order.</p>
</li>
<li>
<p>Differences in test data should represent meaningful, different use cases.</p>
</li>
<li>
<p>When writing a test, try writing the assert statements first.</p>
</li>
<li>
<p>Another quote worth keeping in mind:</p>
</li>
</ul>
<blockquote>
<p>You are writing tests for a reader, not just the computer.</p>
</blockquote>
<p>Tests can function as a sort of documentation by demonstrating the behavior of
the software under different use cases. Similarly, writing tests for packages
created by other people can be a good way of learning how to use them.</p>
<ul>
<li>
<p>However, in your own work, you usually do not test the parts where you use
code written by others.</p>
</li>
<li>
<p>When you don't know what test to write next:</p>
<ul>
<li>Write a test for functionality that seems doable but not obvious how to
implement.</li>
<li>Use a to-do list to guide your test-writing.</li>
</ul>
</li>
<li>
<p>Log strings can make testing easier.</p>
</li>
<li>
<p>Treat the objects like black boxes in your tests. This strategy ensures that
the objects stay modular and do not get too coupled. Tests should leave the
objects in the same state they were in prior to testing. </p>
</li>
</ul>
<h3>Tips for getting tests to pass</h3>
<ul>
<li>
<p>If a test case seems too big, write a smaller test focusing on the broken part
of the bigger one. In this way, you can isolate the bug by finding the smallest
test that fails.</p>
</li>
<li>
<p>Start with a "fake" implementation -- a trivial or non-meaningful solution --
to get the test to pass, then refactor with the right implementation.</p>
</li>
<li>
<p>If you need to handle collections of objects, implement it for a single object
first, then generalize to collections.</p>
</li>
<li>
<p>In general, think first about the code's behavior, <em>then</em> think about its
design.</p>
</li>
</ul>
<h3>Tips for refactoring</h3>
<ul>
<li>
<p>Beck frequently uses the concept of "triangulation", where you only implement
abstraction after you have two or more examples with duplicated behavior. On a
similar note, only unify code when it has become identical in both examples. </p>
</li>
<li>
<p>There are certain design patterns worth keeping in mind for refactoring: null
objects, template methods, pluggable objects and selectors, factory methods,
impostor objects, composite objects, and collector parameters.</p>
</li>
<li>
<p>Refactor long methods by taking out one part and putting it in a smaller
separate method.</p>
</li>
<li>
<p>Temporarily duplicating code or data is a good way to ensure robustness while
moving things around. It means your tests will keep passing during refactoring.</p>
</li>
<li>
<p>The tests themselves can indicate design problems that need to be addressed.
E.g. if the tests require a lot of setup code, if the tests take a long time to
run, and if tests that were passing start breaking unexpectedly. </p>
</li>
</ul>
<h3>Miscellany</h3>
<ul>
<li>
<p>The Expression object that is created to handle sums of mixed currencies in
the first part can be thought of as analogous to a mathematical vector, in which
each dimension represents a currency. Or another way of putting it would be
multivariate linear systems of equations.</p>
</li>
<li>
<p>Beck describes the number of changes per refactoring as a bell curve with a
fat tail. The pedant in me has to note that the graph he shows is a Poisson
distribution (commonly used to model count data, which "number of changes" would
fall under), not Gaussian. It is however leptokurtic as he describes. (Word of
the day: the antonym of leptokurtic is platykurtic.)</p>
</li>
<li>
<p>The book mentions the existence of tools used to evaluate test coverage (e.g. JProbe),
which is an interestingly meta concept. Software that analyzes software!</p>
</li>
<li>
<p>Another fun math metaphor at the end likens software created through TDD as
converging on steady state attractors, where your final destination is
error-free code.</p>
</li>
</ul>First day at 8th Light2016-07-18T00:00:00-05:002016-07-18T00:00:00-05:00Hana Leetag:hanalee.info,2016-07-18:/blog/first-day-at-8th-light.html<p>My first day as a resident apprentice at 8th Light.</p><p>It's been a while since I last updated this blog. Last time I wrote, I was
starting to apply for jobs outside academia. I was primarily focusing
on data scientist positions, due to my experience with statistical
analysis, but I was also interested in the possiblity of a more generalist
software engineering role. The latter would require an expanded skill set
and knowledge of software development, since as a researcher, I
usually only wrote code for a single user: myself.</p>
<p>Enter 8th Light, a <a href="http://8thlight.com">software consulting company</a> that has pioneered the
apprenticeship model for training software engineers. I'd heard about
8th Light while attending tech Meetups in Chicago and their reputation for
excellent coding practices and mentorship. So one day, I submitted an
application, figuring it couldn't hurt.</p>
<p>To my surprise, I was invited to do a code submission, involving
refactoring and extending the code for a tic-tac-toe game. I can't emphasize
enough how much I was able to learn from this experience alone; I think it
was the first time I really got to experience a code review with detailed
feedback on my code. Then I visited 8th Light's Chicago office for an
interview, and to make a long story short, I signed a contract to start as
a resident apprentice in July, with <a href="http://paytonrules.com">Eric Smith</a> as
my mentor.</p>
<p>Today was my first day, and I spent the first part of it dealing with
administrivia, like handing in paperwork, setting up my laptop, getting my
company email and Slack accounts, etc. I got to briefly meet some of the
other apprentices during a weekly talk; the topic this time was on leadership,
particularly on different <a href="http://www.fastcompany.com/1838481/6-leadership-styles-and-when-you-should-use-them">leadership
styles</a>,
when to use them, and how to improve one's leadership skills. (I think that
alone illustrates a lot about 8th Light's company culture: the emphasis on
self-cultivation, the attention paid to organizational and interpersonal
dynamics, and the conscientious approach to the work being done.) Then I met up
with Eric, who wrote the email introducing me to the company and filled
me in on how the apprenticeship would proceed on a day-to-day basis.</p>
<p>I spent the afternoon working through <a href="https://www.amazon.com/Test-Driven-Development-Kent-Beck/dp/0321146530">Test Driven Development by
Example</a>.
The book demonstrates the process of test driven development by
walking you through step by step two example cases. The first builds a
multicurrency calculator, and the second creates a framework for
automated testing. I'd read through the book before and tried to follow the
first example by translating the Java code into Python. But now that I had spent some time
absorbing the basics of Java (as preparation for the apprenticeship),
Eric suggested that I go through the book with code written in Java. That became
my first topic of study.</p>
<p>First I spent some time setting up JUnit, the Java package for unit testing,
and figuring out how to use it. Then I opened the book to chapter 1 and began
putting together the multicurrency calculator. It definitely helps to
be reading this book again with more knowledge of Java; the logic behind the
steps taken is much clearer now that I have a better understanding of how
Java classes are structured and how they work. I worked to the end of chapter
12, which begins implementing an addition method.</p>
<p>What troubles me is that while I think I mostly comprehend the sequence of
thoughts behind each step, I don't yet have a grasp of the more general
principles that should guide decisions during the development process. E.g. there
are some instances when one has to "go backwards" (revert to previous code) in
order to have a test pass. While I understood why that was necessary at that
particular stage of developing the multicurrency calculator, I don't
know how I would recognize another situation that would call for the same
tactic. Kent Beck, the author, says the intuition will develop with
practice, so hopefully it will become more apparent with time.</p>
<p>Two points I have taken away from this afternoon's work:</p>
<ul>
<li>Don't be afraid to write ugly code. Write something that will make the test
pass first, <em>then</em> refactor it to be good code.</li>
<li>There's not a set "size" to the steps taken; you can modulate according
to the situation. But if you're running into problems, the immediate response
should be to make the steps smaller. A possible rule of thumb may be that
the steps should be small enough to make you feel slightly impatient.</li>
</ul>ChiPy Mentorship: Finishing Up2016-01-14T00:00:00-06:002016-01-14T00:00:00-06:00Hana Leetag:hanalee.info,2016-01-14:/blog/chipy-mentorship-finishing-up.html<p>Finishing up the Fall 2015 ChiPy Mentorship Program</p><p>Since my last <a href="http://hanalee.info/blog/chipy-mentorship-progress.html">update</a>, I've made considerable improvements to my classifier for the SF crimes Kaggle competition. Delving into the other variables and engineering new features, such as time of day, helped improve the performance of my logistic regression model considerably. I even got to experiment a bit with <em>k</em>-means clustering, which is an unsupervised learning method, in order to define "neighborhoods" from the longitude and latitude data. I jumped up about 306 positions and had a log-loss score of 2.54026 when used to make predictions on the external test data set provided by Kaggle.</p>
<p>I also experimented with combining multiple models and averaging their predictions to improve performance. Random forests on the same feature set gives a cross-validation score of 2.74832, but when combined with the better performing logistic regression model, it helps improve the overall performance of my classifier and gives a log-loss score of 2.49766 on the Kaggle test data.</p>
<p>As the final part of the mentorship program, the mentees will be presenting five-minute "lightning talks" at the ChiPy monthly meeting tonight. Preparing this presentation has been an extensive learning experience in and of itself: it's my first time using the Jupyter notebook system to make <code>reveal.js</code> slides. I especially like the "subslide" feature that scrolls vertically instead of horizontally, which gives additional structure to the presentation. I've hosted the slides for the talk on Github: <a href="http://hnlee.github.io/sfcrimes/#/">Predicting type of urban crime: Python, Kaggle, and SF OpenData</a>.</p>
<p>Looking back on the goals I've set, I realize that while I haven't quite fulfilled all of them, I've managed to accomplish a fair amount.</p>
<ol>
<li>I've made multiple submissions to the Kaggle competition, and my highest-scoring submission places in the top 25%.</li>
<li>I've learned so much more about machine learning by reading books and watching videos that my mentors recommended to me.</li>
<li>I haven't submitted any job applications yet...but I have a <a href="http://hanalee.info/static/pdfs/resume.pdf">resume</a> instead of the old academic-style CV and am writing cover letters for several job listings.</li>
</ol>
<p>I plan on tinkering a little bit more with the SF crimes data over the next few days, as I have hopes of further improving my model. I'm currently running gradient boosting on the training data, and I would like to do some more parameter tuning.</p>
<p>The mentorship program was a rewarding experience outside of simply working on my project and meeting with my mentors. Going to the coding dojos exposed me to paired/group programming, and the ChiPy mailing list has also pointed me to a lot of resources on software development, not all specific to Python either. It's made me realize that I really enjoy the problem solving and abstract thinking involved in programming, which gives me added confidence that making this career shift is the right decision. </p>Notes on ChiPy December 2015 talks2015-12-17T00:00:00-06:002015-12-17T00:00:00-06:00Hana Leetag:hanalee.info,2015-12-17:/blog/notes-on-chipy-december-2015-talks.html<p>My notes on the talks given at the ChiPy meeting on 10 Dec 2015 at the National Association of Realtors.</p><h3>Portable Format for Analytics</h3>
<p>Speaker: Robert Grossman and Collin Bennet</p>
<ul>
<li>Motivation: often hard to deploy analytics because making models production-ready/scalable requires standardization between different programming languages</li>
<li>Business analytics requires coordinating three different groups of people:<ul>
<li>Data engineers (infrastructure)</li>
<li>Data scientists (model producers)</li>
<li>"Analytic ops" (model consumers)</li>
</ul>
</li>
<li>Previous attempt at standardization: Predictive Model Markup Language (PMML)</li>
<li>PMML describes models in XML but does not include information about how multiple models can be chained together or used as pre-/post-processing</li>
<li>New emerging standard: Portable Format for Analytics</li>
<li>PFA uses JSON and is more flexible than PMML without breaking data pipelines</li>
<li>Titus is the Python engine used to produce and score PFA models</li>
<li>Engines available in other languages, e.g. Aurelius for R (naming system based on Roman emperors)</li>
<li>See <a href="https://github.com/opendatagroup">Open Data Group's Github</a> for more information</li>
<li>Tutorials on PFA available at <a href="http://dmg.org/pfa/">Data Mining Group</a></li>
<li>Normal use case is to develop model in language of choice, use Titus to pull out relevant parts of the model and export it to JSON, and then rewrite it in a faster form for production</li>
<li>Also possible to develop models directly in Titus</li>
<li>Compliance tests to ensure that implementation stays consistent between different languages</li>
</ul>
<h3>SQLAlchemy</h3>
<p>Speaker: Will Engler</p>
<ul>
<li>SQLAlchemy often called "database toolkit" for Python</li>
<li>Beyond just object-relational mapping (ORM) = associating object-oriented classes with database tables (like models in Django)</li>
<li>"Core" layers beneath the ORM features</li>
<li>Engine that connects to many different "dialects" of SQL</li>
<li>MetaData object acts as a registry for database tables </li>
<li>Table object allows representation of tables from databases without creating a new object</li>
<li>SQL expressions as Python objects, enabling ease of chaining for complex queries and switching between SQL dialects</li>
<li>Security features to protect against SQL injection</li>
</ul>
<h3>Meet the micro:bit</h3>
<p>Speaker: Naomi Ceder</p>
<ul>
<li>Slides available at <a href="http://goo.gl/lu7tXc">http://goo.gl/lu7tXc</a></li>
<li>BBC Micro: computer sponsored to aid computer literacy in the UK during the 1980s, very popular in schools</li>
<li>BBC micro:bit: development influenced by the current new push for coding literacy in the UK (including a national curriculum for computing)</li>
<li>Plan was to produce computer very cheaply and distribute to every Year 7 (~11 year old) student in the UK</li>
<li>Behind schedule due to production problems, but currently five models available for testing</li>
<li>Partnered with the Python Software Foundation, goal to get Python on it</li>
<li>"Official" coding options involved Microsoft's touch development environment and block editor</li>
<li>In looking for an alternative, Micropython: originally developed for Raspberry Pi</li>
<li>"Micro:bit world tour": send out models and track their progress on a <a href="http://microworldtour.github.io/">map</a></li>
<li>Cool stuff:<ul>
<li>Use onboard accelerometer to fly an X-wing in Minecraft</li>
<li>Flash messages on LED array</li>
</ul>
</li>
</ul>
<p><strong>Addendum:</strong> I never wrote up the notes for the November meeting, but the <a href="http://tanyaschlusser.github.io/Python-Fu-in-GIMP.slides.html#/">slides</a> and <a href="http://bit.ly/chipy_gimp">code</a> are available for Tanya Schlusser's talk, "Python-fu in the GIMP". Worth reviewing!</p>ChiPy Mentorship: Progress2015-12-07T00:00:00-06:002015-12-07T00:00:00-06:00Hana Leetag:hanalee.info,2015-12-07:/blog/chipy-mentorship-progress.html<p>What I've accomplished so far.</p><p>To recap, my main objective for the ChiPy mentorship program is to work on the <a href="https://www.kaggle.com/c/sf-crime">San Francisco crime classification</a> Kaggle competition. I've made the Jupyter/iPython notebook I am using for analysis available <a href="https://github.com/hnlee/sfcrimes/blob/master/sfcrimes.ipynb">on Github</a>.</p>
<p>This particular data set is very simple, consisting only of eight features aside from the target, which corresponds to the category of crime. I began with some exploratory data analysis, looking at the frequency of crimes across different categories, police districts, and days of the week. I also took a look at the distribution of crimes across longitude and latitude, which revealed some extreme outliers, likely to be the result of mistakes in data entry.</p>
<p>Kaggle evaluates submissions based on the <a href="https://en.wikipedia.org/wiki/Loss_functions_for_classification#Logistic_loss">log-loss metric</a> and require you to predict probabilities for each class of crime category on every row in the test set. Thus, the very simplest model would simply be the average frequency of each crime category. I split the training data into internal training and test sets for quick cross-validation and fit the model by calculating the means on the internal training set. The performance of this model is actually not too shabby, with a log-loss score of 5.46035 on the internal test set.</p>
<p>The next simplest model was to use a single categorical variable, that of police district, as a predictor. After transforming this column into dummy variables, I fitted a logistic regression model to the internal training set. This model performed even better, with a log-loss score of 2.61714 on the internal test set. I then used this model to generate predictions on the test data from Kaggle and made a submission, just as a practice run. My initial position on the leaderboards at the time of submission was at a pretty abysmal 476th place. (I've slipped even further in the standings since then!) But my score on Kaggle's test set was 2.61626, which is not very far off from the current top score on the leaderboard at 2.06702. That implies that I'm already very close to the limits of the signal in the dataset...assuming of course that there isn't some undiscovered breakthrough that hasn't occurred to anyone in the Kaggle community yet.</p>
<p>Still, I suspected that more sophisticated machine learning methods could perform better, however incrementally, than my univariate logistic regression. As a next step, I incorporated the day of the week as another categorical variable into the logistic regression model. However, this feature did not appear to contribute very much, as my log-loss score from the internal test set was only 2.61416.</p>
<p>I turned to random forests, a popular ensemble learning method that is based on decision trees. I also set up five-fold cross-validation (more robust than a single split into training and test). Using police districts and the day of the week again, random forests did not seem to perform any better than logistic regression, with a mean log-loss score of 2.61821 across the five folds.</p>
<p>I spent last Sunday discussing what to do next with my mentor, and I plan to work on the following over the next couple of weeks:</p>
<ol>
<li><strong>"Hacking" the score:</strong> Since performance is measured via log-loss, there's an extreme penalty for predicting any crime category probability as zero. Adding a minimal constant and renormalizing before submitting the probabilities matrix should help avoid this situation.</li>
<li><strong>Feature engineering:</strong> Try to harness what signal is remaining in the other variables, such as extracting the month or time of day from the date column. (Time of day could be even be binned into categories like morning, afternoon, and night.) Given the usefulness of the police district column, the higher-resolution geographic data in the latitude and longitude columns may prove to perform even better, although it would require some initial cleaning and may need to be processed via clustering first.</li>
<li><strong>Gradient boosting:</strong> Random forests is basically a <a href="https://en.wikipedia.org/wiki/Bootstrap_aggregating">bagging</a> method. Another major category of ensemble learning algorithms is <a href="https://en.wikipedia.org/wiki/Boosting_(machine_learning)">boosting</a>, which in the form of <code>XGBoost</code> (<a href="https://github.com/dmlc/xgboost">eXtreme Gradient Boosting</a>) has been one of the most successful methods on Kaggle. Since it is fundamentally a different approach to combining multiple machine learning models, it should be worth trying.</li>
<li><strong>"Meta-learning":</strong> Going one step up from ensemble learning by running multiple methods and averaging their results. (The results can also be combined in more sophisticated ways, obviously, e.g. weighting the methods based on their respective performance.)</li>
</ol>
<p>As a side note, going from R to Python with <code>numpy</code> and <code>pandas</code> has made me think that there really ought to be a website that allows you to search an R function and find the equivalent in Python. I've spent a lot of time muttering to myself things like, "How do I <code>cbind()</code>? How do I <code>colSums()</code>? What's <code>table()</code> in <code>pandas</code>?" (If there is such a website, let me know!)</p>Notes on "Lessons from 2MM machine learning models"2015-11-03T00:00:00-06:002015-11-03T00:00:00-06:00Hana Leetag:hanalee.info,2015-11-03:/blog/notes-on-lessons-from-2mm-machine-learning-models.html<p>My notes on a talk given by Kaggle's founder, Anthony Goldbloom, on 2 Nov 2015 at the Blue 1647 Innovation Center.</p><p>Not a comprehensive outline of the talk, just a list of points that I found interesting.</p>
<ul>
<li>During the timeline of a competition, scores reach a plateau or floor where subsequent increases in accuracy are minimal<ul>
<li>The "four-minute mile" phenomenon: when someone makes a breakthrough that dramatically pushes past a plateau, it is immediately replicated by others</li>
<li>Otherwise the floor represents the limits of the signal in the dataset</li>
<li>Usually the floor is reached unless there is too much noise or not enough signal in the dataset</li>
</ul>
</li>
<li>Neural networks are dominating in any competition involving images, speech, or text</li>
<li>Two approaches to winning<ul>
<li>Creative feature engineering: make plots, test many different combinations of features, use version control to keep track<ul>
<li>E.g. used car competition, where winning model depended on the crucial feature of unusual car colors vs. standard car colors</li>
</ul>
</li>
<li>Parameter tuning: usually only gets incremental improvements in score</li>
</ul>
</li>
<li><a href="https://github.com/dmlc/xgboost">XGBoost</a> (variant on gradient boosting) also dominating in competitions</li>
<li>To guard against overfitting, final scoring of submissions uses completely new test data<ul>
<li>Overfitting is the most common issue in supervised learning problems</li>
<li>Phenomenon where someone high up on leaderboard drops a hundred places after final scoring</li>
<li>Can guard against overfitting by ignoring feedback from parameter tuning unless score improves above standard error</li>
</ul>
</li>
<li>How are test sets generated?<ul>
<li>Out-of-time sampling</li>
<li>Out-of-sample sampling</li>
<li>Stratified sampling (if one of the classes being predicted is very rare in the dataset)</li>
</ul>
</li>
<li>Boundaries between different types of problems: which ones suited for neural network approachvs XGBoost/random forests/etc. approach?<ul>
<li>Unstructured data for former, very structured data for latter</li>
<li>What about in-between cases? e.g. EEG data for grasping vs lifting: time series data where neural networks won</li>
</ul>
</li>
<li>Any way to automate feature engineering? - a hard problem...</li>
<li>Optimizing behavior in response to machine learning results? - also a hard problem...</li>
<li><a href="https://www.kaggle.com/scripts">Kaggle Scripts</a> as a learning resource, "Github for data science"</li>
<li>Properties of Kaggle winners: good coders, careful use of version control, coding best practices, tenacity </li>
</ul>ChiPy Mentorship: Learning Goals2015-11-02T00:00:00-06:002015-11-02T00:00:00-06:00Hana Leetag:hanalee.info,2015-11-02:/blog/chipy-mentorship-learning-goals.html<p>Objectives for the Fall 2015 ChiPy Mentorship Program</p><p>Last month, I learned that the official Chicago Python user group, ChiPy, organizes a mentorship program for members looking to improve their Python skills. I applied to the Data Science track and was assigned to Eric Meschke, who is a software engineer working in the finance industry. I met up with Eric a couple of weeks ago, as well as his mentee from last year, Alex Flyax. It turned out that Alex, like me, had finished his postdoc and decided to transition to a career in data science. By the end of the mentorship program, he had successfully obtained a job at a startup in Chicago, where he is currently working. (So basically, I hope to follow in his footsteps.)</p>
<p>I'm really excited to have the benefit of both Eric and Alex's perspectives. At our first meeting, they shared very good advice on what it's like to work in the data science field as well as tips on preparing job applications. Afterwards, Alex emailed me a long list of resources, particularly textbooks and videos that would be helpful in shoring up my theoretical knowledge. </p>
<p>When I applied to the program, I stated that my goal was to complete a Kaggle competition to use as a portfolio project during the job hunt. That turned out to be a good idea, since it was exactly what Eric and Alex did last year! Eric advised picking a competition that has clear metrics for evaluating submissions, and Alex recommended sticking to general machine learning methods (rather than more specialized methods for time series data or natural language processing) for now. I looked over the open Kaggle competitions and decided to tackle the one on <a href="https://www.kaggle.com/c/sf-crime">San Francisco crime classification</a>. As a former resident of the Bay Area, it's a topic that I find personally interesting. I may end up working on this project together with Alex's mentee, Zaynaib (you can go read <a href="https://zenagiwa.wordpress.com/tag/chipy/">her blog posts on ChiPy</a> as well).</p>
<p>I've worked through the Kaggle tutorial on <a href="https://www.kaggle.com/c/titanic">Titanic survival data</a> at <a href="http://dataquest.io">Dataquest</a>. (I've put up my iPython/Jupyter notebook going through the steps at Github: <a href="https://github.com/hnlee/titanic">titanic</a>.) I've also been reviewing probability and statistical theory and learning more about machine learning methods through watching course lectures.</p>
<p>By the end of this mentorship, I hope to have accomplished the following:</p>
<ol>
<li>Make a submission to the SF crimes Kaggle competition that scores in the top 10%, using only Python.</li>
<li>Study statistical theory, machine learning, and algorithms.</li>
<li>Do at least one practice job interview with my mentors.</li>
<li>Apply to data science jobs (...and hopefully get interviews and an offer letter).</li>
</ol>Goalsetting2015-11-01T00:00:00-05:002015-11-01T00:00:00-05:00Hana Leetag:hanalee.info,2015-11-01:/blog/goalsetting.html<p>Where I see myself in the future, professionally</p><p>For the past couple of years, I've been tossing around the idea of creating a blog for my professional life. Now that I am beginning a career transition, from the world of academic research to data science, it seems like an opportune moment to make one.</p>
<p>As a starting point, I wanted to write down my answer to that ubiquitous job interview question: "Where do you see yourself in <em>x</em> years?"</p>
<p>In a year, I see myself as employed full-time in a data scientist position, working with Python and R to construct models for making predictions from interesting data<sup id="fnref:1"><a class="footnote-ref" href="#fn:1">1</a></sup> and to create user-friendly interfaces to visualize and communicate this data to others. I will be contributing to at least one open source project and actively continuing to improve my programming skills and knowledge of statistical methods.</p>
<p>In five years, I see myself moving up the career ladder wherever I am employed. I will be engaged with local developer communities and other professionally relevant organizations (e.g. participating in hackathons, attending conferences). I will be "paying it forward" by following the example of those who have helped me with my career and mentoring junior colleagues. I will have grown fluent in at least one new programming language. I will have started my own open source project in addition to making contributions to others. </p>
<p>In ten years, I see myself in a leadership position of some sort at my place of employment. I will have a robust professional network. I will still be acquiring new skills and mastering new methods as the landscape of the data science field grows and changes.</p>
<p>Right now, I've been focused on putting together portfolio projects to show potential employers. The above goals seem rather remote given that I am in the trenches of job hunting. But keeping the big picture in mind should help motivate what I do now. </p>
<div class="footnote">
<hr>
<ol>
<li id="fn:1">
<p>What counts as interesting data? That could be a topic for a future blog post. <a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
</ol>
</div>