Archives For

California leads the way with roughly 8k EHR Meaningful Use Installs

California leads the way with roughly 8k EHR Meaningful Use Installs


I stumbled upon this commencement speech on on Friday.  It got me thinking a little bit on higher education in the US, and in particular, the core curriculum at many colleges and universities.  I think there is great value in a liberal arts education, but I always thought the core curriculum could be adjusted a little bit to guarantee some basic skills are acquired along the way.  Here is what the core curriculum looks like for the college of Arts & Sciences at my alma mater: Core Curriculum – Boston College

I like the wide spectrum of material covered and the variety of ways that it exercises the brain, but of the 15 courses needed to complete the core curriculum, I would require a computer science course (per the aforementioned article), a basic finance course, and maybe an accounting course.

Maybe there are others that should be a requirement as well?


Major joint replacements cost Medicare more than $6B in 2011

Yesterday there was quite a bit of buzz about the release of Medicare’s inpatient payments for the top 100 diagnosis related groups.  The Washington Post published some highlights on the data including a neat widget to visualize the data.  We decided to take our own shot at it.  This was a fun dataset for us since we were able to leverage a ton of work that we’ve already done.  We had already standardized entities for hospital, organization, DRG, and city from other CMS datasets.  I downloaded the data at 3 PM and had it up and running on karmadata by 5.  We added a couple of calculated measures for total amount paid by Medicare and discrepancy between amount charged and amount received, and started making datacards.

Here’s what folks are saying:

EMA drug approvals

We recently added the European public assessment reports (EPAR) data set.  Here is a rank of non-generic EMA product approvals since 2009 by company.

Orphan Drug Designations

Orphan drugs have been a hot topic of late.  We created this card which shows a steady increase in the number of companies receiving orphan drug designations from the FDA.


This datacard caught our eye since we’ve seen recent news reports on the increased requests by foreign governments for user data to Google and Microsoft as reported by Forbes. Events such as Monday’s Boston bombings likely cause a spike in such activity (and for good reason).


People are inherently social.  Facebook, Twitter, Pinterest, LinkedIn, Spotify, FourSquare.  The social media list goes on.  And while many people avoid social media for reasons ranging from privacy concerns (more on this later) to not wanting to know what everyone is doing 24 hours a day (everyone has a friend or two who are social media spammers), the overall popularity of these sites indicates an almost unquenchable thirst for socializing, sharing, collaborating, and interacting.

What Facebook is to friends and photos, LinkedIn is to colleagues and work connections, and Spotify is to music fans and music, karmadata is to data consumers and data visualizations.  So even while we are heads down, programming, and buried in code, there is always an overriding sentiment in the back of our minds: we are building something social, collaborative, and most importantly, fun.

Another form of online interaction is blogging (like this one), and making data visualizations into mini-blog posts is the inspiration behind where we are heading with our datacard design.  The idea is that each datacard tests or validates a theory, and the user can then publish their insights on karmadata.  We are trying to make datacard creation as personalized, interactive, and fun as possible.  That means creating custom titles, descriptions, x and y-axis labels, and anything else that our user community can come up with.  We do our best to provide users with the basics, but our vision is that our users will take the value of the datacards to another level.  That means that an auto-generated y-axis of “# of Clinical Trials” can be quickly altered to “# of Phase III Leukemia Trials”.  Editing filters, seeing how it affects the data, customizing the metadata.  All of this should be fun.

That’s the fun part.  The social part is sharing that mini-blog with your friends and colleagues, engaging in comments back and forth, and leveraging the expertise of each other to answer questions and solve problems.  Or it can be finding a datacard that someone else has already created that answers the same question that you have.  This sharing and collaboration is the first half of our namesake.  The idea is that you get out of the community what you put into it.  Sharing is good karma.

Now much like the person who wants to “stay off the grid”, we recognize that many data consumers will not want to share because they do not want others seeing what they are interested in.  Many pharma, in particular, have a paranoia about competitors knowing what they are up to, and in many cases these concerns are valid.  But since we want everyone to share, our philosophy is that if you don’t want to share, then you have to pay to avoid sharing.  In such as case, a company can pay for karmadata Plus to unlock functionality for everyone at their company to remain anonymous and closed off from the rest of the community (Plus users also receive other benefits like data download and upload of internal data but that’s a story for another day).

In any case, we ask you to share your ideas with us about how to make the site more fun and social (because that’s good karma).

Trend of Novartis Cancer Trials

Trend of Novartis Cancer Trials

This week’s featured datacard is from our friend (and thoughtful beta user) Moe Alsumidaie from Annex Clinical.  The datacard trends Novartis’ cancer (ICD-9 140-239) trials over time by trial start date. Trials have steadily increased over time and already have 10 started or planned for 2013.  (As an aside, Moe runs a thoughtful on LinkedIn for those interested in the clinical space: Breakthrough Solutions in Clinical Trials and Healthcare.)

From time to time we’ll highlight a data set on karmadata.  Today I’ll provide a quick look at the NIH RePORTER grants database.

The RePORTER database (which replaced the old CRISP database) “provides access to reports, data, and analyses of NIH research activities, including information on NIH expenditures and the results of NIH supported research.”  In other words, we get to see our tax dollars at work.

When looking at these data sets I’ll try to highlight what is great about the source data/website (I can’t just be complaining all the time), and then highlight the value that we’re able to add.

The data itself (provided in both csv and XML) contains the funding agency (NIH, NCI, etc), the organization receiving the grant, the location, the principal investigators running the study, a list of terms associated with the project, and the amount funded for the project.  The RePORTER website has some pretty nice functionality for aggregating and ranking by those different entities.  You can play around with that tool here.  You can even map the data and drill down to view grants awarded to different states.  Neat.  The greatest limitation is probably the fact that you can only analyze the data one fiscal year at a time, but overall it’s a pretty nice presentation of the data.

The first thing I look for when I get my hands on a new dataset is the potential entities that we can standardize to.  This was a fun dataset for me because of all the entities that can be teased out.  In addition to the aforementioned entities, we were able to match the terms list to drugs and diseases.  The RePORTER database also provides an ID for the principal investigators, but unfortunately, much like the reviewer ID from BMIS, it is not unique.  We consolidate those entries.  We consolidate different company names to resolve to a unique ID, and then we are ready to go: city, state, country, organization, principal investigator, drug, disease, and time.  A robust database for both building our entity profiles and creating cool visualizations.

Leading Organizations Receiving NIH Grant Funding

Johns Hopkins leads organizations receiving NIH grant funding


Some facts we have gleaned from the database:

  • Johns Hopkins leads the way in NIH funding since FY2000 (with more than $7.5 billion)
  • NIH funding increased steadily from 2000 until peaking in 2010 at $38 billion
  • Boston leads the way in funding over that time (score one for Boston in the Boston-New York rivalry)
  • NIH funding was not limited to the United States.  $5.4 billion were funded outside the US since 2000, with South Africa leading the way
NIH grant funding trend

NIH grant funding peaked in FY2010

That should give you a flavor for what you can do with the dataset.  Try copying one of my datacards and discovering your own insights.

One of our key missions is to curate open data sources and provide the back to the world so that the brilliant thought leaders in the industry can use data more effectively and efficiently.  This means taking the data in its native format (typically XML, txt, csv, Excel), loading it into our Oracle relational database, standardizing the data to important entities like person, place, organization, drug, disease, and then providing that data for download in a standard text file format.

Since we want other companies to be downloading and using our data, I figured I should go through the process myself and see how long it would take me to download and load some karmadata.  I chose one of my favorite datasets,, which is published by the National Institutes of Health.  I chose because it is published in XML with a fairly complex schema and there are plenty of free text fields that make standardization ultra-difficult and important.

We’ve attempted to make getting off the ground with karmadata as quick and easy as possible.  Our Toolkit contains all the metadata that you should need, as well as the SQL scripts to load the data (currently complete for Oracle, but will be completed for SQL Server, MySQL, etc. in the near future).  The hope was that someone could download the data, load it into a relational database, and answer a hard to answer question in less than a day.

I began by heading to and cruising the available files on the download page.  Knowing that I wanted to load, I clicked into the Source Files section to view the raw source data, and then to Fact Files sections to check out the standardized records that accompany it.  I downloaded all of the available files.

Next I downloaded the files provided in the Toolkit section. I read the readme.doc to take me through the process.  I found it to be extremely well written.  Whoever authored it must be incredibly brilliant and good looking.  I identified the scripts for creating the tables for fact and source data, as well as the external table scripts to load the data into those tables.

Then I got started.  I created the tables for loading, unzipped the first period of data, and ran the inserts to load the data.  Rather than programmatically unzipping and loading the data, I simply manually unzipped and ran the inserts as I went.

Ten minutes after I read the readme document, I had the entire dataset loaded into a relational database, and best of all it was standardized to entities for sponsor organization, clinical sites, clinical investigators, geography, disease, drug, and time.

Now the fun part.  The last thing that we provide in the toolkit is a couple of queries to get you started to play around with the data.  In this case we ask the question, which are the leading sites in running industry sponsored, neurodegenerative disease trials, from 2009 to 2012?  I run the query, and boom, I’m looking at a list that looks like a less attractive version of this data visualization.


You could download from karmadata, or you could just create this data visualization on karmadata

Now just to recap what it would take to run that from scratch, you would need to go to, download the entire dataset in XML, load the XML into a relational database, standardize the start dates to dates, standardize the many versions of each site name to a standard identifier, then group together all of the MeSH terms that fall under neurodegenerative diseases, and then run a query similar to the one we provided.

These are enormous barriers to entry to a functional, effective way of using the data.  But what took us countless hours of development, can take you about 10 minutes.  (Or you could just find or create a datacard on in about 10 seconds, but you get the point.)

Using our service was a little surreal for me.  I was downloading data that I had downloaded, loaded, and standardized, and then was loading it back into an Oracle database.  But it left me wishing that I could just use something like karmadata instead of dealing with all the pains that come with unstandardized data sets.  Hopefully it will make you feel the same way.