Hyunwoo Park

Sep 19, 2014

I recently worked on a project that required getting latitude and longitude based on unstructured addresses. My advisor found me a website that run this conversion in batch and it turned out that website was quite decent.

http://www.findlatitudeandlongitude.com/batch-geocode/

Looking at its javascript a bit, they seem to be using Google's API. So, if you just put in any piece related to something's location, it will do its best to convert into normalized address, and latitude-longitude coordinates. Here's an example.

"original address","returned address",latitude,longitude,accuracy,status code
"georgia tech","Georgia Institute of Technology, North Ave NW, Atlanta, GA 30332, USA",33.775618,-84.396285,3,200
"seoul","Seoul, South Korea",37.566535,126.977969,3,200
"white house","The White House, 1600 Pennsylvania Avenue Northwest, Washington, DC 20500, USA",38.897676,-77.03653,3,200

It seems to be a pretty useful research tool for me as I don't have to code for myself to get to the Google's geocoding API.

Sep 19, 2014

Since having graduated from ischool, I have not followed the trends in the tech startup space much for years taking classes doing research all good stuffs. Many things seeminly have happened: crunchbase started and some new language and frameworks (node.js, golang) have appeared.

Recently, I signed up to receive daily newsletter from crunchbase. It sends me digest of startups that got funded. Skimming through them one by one is becoming my daily habbit now. You can browse the archive of their newsletter from here: http://static.crunchbase.com/daily/content_20140919_web.html

Then, a few days ago I found this new service called producthunt. It's basically daily tournament of new products and services.

One thing I recently found was drop. You can take a look yourself.

Obviously cool things are coming to the market every day and it's quite amazing to watch them almost real time.

Sep 10, 2014

Most apps on the cloud are sleeping as the platform provider idles an app if there's not a request for a certain amount of time. Heroku's threshold is 1 hour. I found several discussions on the web:

I first tried adding new relic to my heroku app, but it seems an overkill and was a bit involved to get it work.

Then, I found this "Kaffeine" and it worked for me. One limitation is that it's only for heroku apps. http://kaffeine.herokuapp.com/

For other cloud providers, here are a few alternatives I found.

Sep 8, 2014

I have used colors from the Google Visualization default palette for several projects. You can quickly generate all colors using a simple code below and pick those colors and save for yourself.

http://jsfiddle.net/qfymrmnd/

If you don't have a way to extract RGB codes, here it is. The order is blue, red, orange, green, purple, and so on. Although I giving you about 20 colors here, I personally try not to go beyond 5-6 colors at the most. The colors in this list beyond that threshold seem to start repeating itself.

#3366CC
#DC3912
#FF9900
#109618
#990099
#0099C6
#DD4477
#66A900
#B52D2D
#2F5F8F
#994499
#21A797
#AAAA11
#6633CC
#E67300
#8B0707
#651067
#329262
#5574A6
#3B3EAC
#B77322
#16D620
#B91383
#F4359E
#9C5935
#A9C413
#2A778D
#668D1C
#BEA413
#0C5922
#743411
Sep 1, 2014

Introduction

Circos is a visualization tool that draws network in a circular fashion. To my experience and best knowledge, it is the richest medium in which a network can be shown and data can be visually encoded.

What this post is NOT about

Circos is not the most user-friendly visualization tool on earth. For those looking for installation help, you've got a wrong number. Here are a few recommended reading that I referred when I had some problems in installation of circos.

Making sure all parts of circos are downloaded and properly loaded is just painful. I guess more than half of people who were fascinated by a circos graphic would give up trying to install it on their machine. It was just that hard for me.

Once you successfully install and are able to produce a graphic following offical tutorials provided by circos, you will be amazed by the comprehensive coverage of the official tutorial. However, the problem with having a comprehensive set of tutorials is that you cannot easily find a way to convert your traditional network viz into one of the cool circos viz---both conceptually and technically.

This post is intended for those who 1) have installed circos, 2) have produced some graphics following its tutorial, and 3) now want to plug your own data into the circos format. It's my attempt to document the way I understand how one can transform a usual network visualization into a circos visualization.

Anatomy of a circos visualization

Before getting started, I'd like to emphasize that a circos visualization has different naming conventions for its parts. This made it hard for me to understand what their tutorial meant from the beginning. So, first off, I recommend you skim through the following nice summary slides on anatomy of circos graphics.

http://jura.wi.mit.edu/bio/education/hot_topics/Circos/Circos.pdf

From the figure above, remember four elements: (B) ideogram, (H) ticks, (F) highlights, and (E) links. Ideogram means the circular arc segments around a big circle with some thickness. Ticks show units of viz. Highlights are meant to emphasize a certain part of an arc. Lastly, links are connection between arcs.

How circos is conceptually different from the usual network visualization

If I were to create a usual network visualization of two-node graph, it would look like this.

Circos can visualize this kind of relationship for sure, but it is capable of doing the job for much more complex relationships. For example, suppose the two-node graph we saw above is now a multi-graph, i.e., a pair of nodes can have more than one edges. The figure below shows this network. Nodes 1 and 2 now have three edges between them with varying weights.

If you share some sense of aesthetics with me, you realize it's ugly---more important, it's arbitrary---and there must be a better way to deal with this sort of situation. And, circos is the one.

Understanding circos data format

Initial purpose of circos was to visualize relationship among chromosome in genes. Look at some of these wiki pages to see if it helps.

You may think its origin doesn't really matter as long as it works to solve your problem. But, the problem is that circos documentation explains things using these biology jargons---karyotype, chromosome---which I think hinders understanding of general audiences.

After some hours of struggling, I devised my own way of interpreting the biological concepts built in circos. First of all, a chromosome is a node. So, you need to prepare a file that contains the list of all nodes. Suppose nodes 1 and 2 are US and China, respectively, and you are trying to visualize some trades between them. The first thing you need is something like this.

nodes.txt

chr - usa USA 0 2000 myblue
chr - chn CHINA 0 1000 myred

Let me explain one by one.

  • Two lines: We will have two nodes in our viz.
  • Every line starts with "chr - ": It's just a convention denoting that this line describes a node (i.e., a chromosome).
  • "usa" / "chn": node id
  • "USA" / "CHINA": node labels
  • 0 to 2000 / 0 to 1000: node size (i.e., start and end position). The USA node is of size 2000 and the CHINA node is of size 1000. Note that circos only accepts integer as its positioning parameter.
  • The last element of each line denotes node color. I will explain how to define your own color in the next section. Here we focus on setting up data files in the right format.

Now that you prepared the list of nodes, let us get to list of edges. Edges are called "links" in circos. Recall the example of two-node multi-graph above. Suppose we are trying to implement three edges between USA and CHINA.

edges.txt

usa 200 500 chn 100 250 color=myblue_transparent
usa 700 900 chn 500 600 color=myred_transparent
usa 1200 1300 chn 800 850 color=myblue_transparent

Each line is formatted as "node1_id node1_start node1_end node2_id node2_start node2_end color=mycolor". Note that a pair of nodes can have multiple edges and each edge occupies different part of each node. This will be more evident in the final graphic.

We have prepared all the basic elements so far. These two files are just bare bone. However, circos actually provides many more charting functionalities which I cannot go over in this post. Let me show you how to play around with one of them here. Suppose you want to highlight some parts of the nodes with different color. Then, you prepare the following file in addition to nodes.txt and edges.txt.

highlights.txt

usa 100 700 fill_color=myred
usa 700 1300 fill_color=myblue
usa 1300 1900 fill_color=myred
chn 100 500 fill_color=myblue
chn 500 900 fill_color=myred

Structure of this highlight file will become self-evident when we see the output visualization.

Putting all together into a visualization

At the heart of every circos visualization is the configuration file. A config file contains a list of commands (or directives) you want the viz engine to perform. Your custom definition of color, font, and placement all go into the config file. Now that you have all three data files ready---nodes.txt, edges.txt, and highlights.txt, you just need to invoke these files using the right language in the config file. Let's say your main configuration file is named "usachn.conf" under "etc/" folder and data files reside in "data/usachn/" folder.

First, read in your nodes by this.

karyotype = data/usachn/nodes.txt

Your edges are read by this.

<links>
<link>
file = data/usachn/edges.txt
ribbon = yes
flat = yes
radius = dims(ideogram,radius_inner)-30
bezier_radius = 0r
</link>
</links>

"ribbon" and "flat" should be set "yes" in order to make circos render edges as defined in our data file: edges.txt. "radius" determines where links start. In this case, edges are drawn 30 pixel inside of the inner circle of ideogram. (Ideogram means the circular ring of nodes.) "bezier_radius" determines curvature of the edges.

Highlights are called in as follows.

<plots>
<plot>
type = highlight
file = data/usachn/highlights.txt
r0   = dims(ideogram,radius_inner)-5-15
r1   = dims(ideogram,radius_inner)-5
stroke_color = dgrey
stroke_thickness = 0p
</plot>
</plots>

Output file destination is put in as follows.

<image>
<<include etc/image.conf>>
file* = circos-usachn.png
</image>

Lastly, your custom colors ("myblue" and "myred") are defined in the configuration file as follows.

<colors>
myblue = 0, 0, 255
myred = 255, 0, 0

myblue_transparent = 0, 0, 255, .5
myred_transparent = 255, 0, 0, .5
</colors>

For each line of custom color definition, the first three elements are R, G, B, and the fourth optional field is for alpha (transparancy). 0 is fully opaque and 1 is fully transparent.

The full configuration file can be viewed and downloaded here. You run circos using this config file in the command line as follows.

circos -conf etc/usachn.conf

And finally the resulting visualization will look like this.

It may not be as fancy as what you saw on the internet, but you probably have a better idea by now on how circos interpret your commands and data. You can play around with some of the parameters in the config file to see which setting leads to which output feature.

Conclusion

Circos is just impressively rich medium for visualization. It provides tons of other visualization elements such as histogram or 2D plots. Admittedly, I don't know everything circos offers. But, when I struggled with the conceptual aspect of circos, I couldn't find a simple to-go example on the web. All documentations and even tutorials seem very archaic to me. (Now I understand them better probably.) So, I decided to write one. Hope it helps!

Aug 23, 2014

I have spent some time past month doing some games. One is Caravaneer 2 which I longed for years after I had enjoyed playing its previous version or prequel Caraveneer. Another game I spent some time doing was Capitalism 2 which I enjoyed playing many years ago. It was good even at this time.

Now, I am shifting my focus back on my main work---research. I am on a few visualization tasks and dabbling with Circos. Circos seems to be one of the best visualization tools you can use to create a fine-tuned and nuanced infographics in a circular layout. Installing and working with it is a little bit obscure, but I think I am figuring it out. I plan to write my version of tutorial for others here.

May 16, 2014

I came across a blog that can be a model of what I am envisioning for my small blog. It's called BPS Research Digest. Although it is a authoritative blog published by the British Psychological Society and my blog is just by me, I would like to use this space to review the paper I found interesting in the manner that BPS presents.

The post that led me into the blog is this. I quickly analyzed the structure of the writing to have a guideline for myself. http://bps-research-digest.blogspot.com/2014/04/a-self-fulfilling-fallacy.html

From reading this post, a general structure I could write is as follows.

  1. Introduce with a commonly known story that could be potentially the motivation of the research. In this case, the author mentions a common human error in judgment called the Gambler's Fallacy. (about 150 words)
  2. Explain what the authors actually did in the paper. Experiment? Modeling? Secondary data analysis? Highlight the main contribution only even if the authors have done many things for the whole paper. (about 100 words)
  3. Insert a picture in the middle if it can be helpful. (optional; Search Creative Commons images from flickr here)
  4. Summarize the results. (about 200 words)
  5. Summarize the authors' interpretation on the results. (about 150 words)
  6. Conclude by going back to the introductory theme used in the introduction---the gambler's fallacy in this case. (about 100 words)

If I can write this way, this alone will amount to 600-700 words. I hope it will be a good exercise for me to learn how to shape a good research question.

May 13, 2014

Today I became a Ph.D. candidate! I presented the dissertation proposal last Friday and today. I have spent about a month for this proposal. Scheduling was one of the toughest things to do, and that's why I proposed twice. (One should not do it twice.)

Overall, this official milestone of my Ph.D. study has prepared myself to better frame and position the work I have been doing. In fact, I realized that a dissertation might be slightly different from a research paper to be published at an academic journal. The committee looks for a theory that can be as generalizable as possible and as applicable as possible to multiple contexts, while a research paper might want to be very specific on admitting limitations of the work. This actually made me think how I can and should generalize the findings or hypotheses into other contexts such as automobile, ship-making, etc. This is quite a different task than I have been doing. It will be challenging but I find it must-have for a dissertation.

In addition to proposal, I presented a few presentations at POMS over last weekend. In total, I pitched four presentations over four days. It was very exhausting, but also an intensive learning period. The most important thing I think I learned out of this presentation spree is the importance of storyline. Having a good narrative always helps. When I do research, often I am so focused that I lose the big picture on what I am doing and trying to say. In that sense, creating a powerpoint deck for the paper I want to present helps me stay in the consistent storyline and understand the key contributions of my own work.

During the last week, I have missed the 500-word quota. As I restart my routine workload, I need to get back to the quota again. This blog post has 308 words.

May 11, 2014

This is first time for me to join POMS conference. Fortunately, this year it is held at Atlanta (5 min drive from the school). I will do two presentations today at 1 pm and 4 pm in A707.

May 6, 2014

I studied mediation vs. moderation today.

Suppose A (cause) leads to B (effect). Moderator is something that affects the relationship between A and B. Mediator is something that explains why A leads to B. I summarized it into the following figure.

If mediating term is included, the main effect from A to B should be reduced because some part of the effect should be explained by the mediation path. Moderator, on the other hand, is detected by interation term with A. The moterator itself can have direct effect on B.

References: