Orphan drugs have been a hot topic of late. We created this card which shows a steady increase in the number of companies receiving orphan drug designations from the FDA.
Archives For #dataShows
This datacard caught our eye since we’ve seen recent news reports on the increased requests by foreign governments for user data to Google and Microsoft as reported by Forbes. Events such as Monday’s Boston bombings likely cause a spike in such activity (and for good reason).
This week’s featured datacard is from our friend (and thoughtful beta user) Moe Alsumidaie from Annex Clinical. The datacard trends Novartis’ cancer (ICD-9 140-239) trials over time by trial start date. Trials have steadily increased over time and already have 10 started or planned for 2013. (As an aside, Moe runs a thoughtful on LinkedIn for those interested in the clinical space: Breakthrough Solutions in Clinical Trials and Healthcare.)
From time to time we’ll highlight a data set on karmadata. Today I’ll provide a quick look at the NIH RePORTER grants database.
The RePORTER database (which replaced the old CRISP database) “provides access to reports, data, and analyses of NIH research activities, including information on NIH expenditures and the results of NIH supported research.” In other words, we get to see our tax dollars at work.
When looking at these data sets I’ll try to highlight what is great about the source data/website (I can’t just be complaining all the time), and then highlight the value that we’re able to add.
The data itself (provided in both csv and XML) contains the funding agency (NIH, NCI, etc), the organization receiving the grant, the location, the principal investigators running the study, a list of terms associated with the project, and the amount funded for the project. The RePORTER website has some pretty nice functionality for aggregating and ranking by those different entities. You can play around with that tool here. You can even map the data and drill down to view grants awarded to different states. Neat. The greatest limitation is probably the fact that you can only analyze the data one fiscal year at a time, but overall it’s a pretty nice presentation of the data.
The first thing I look for when I get my hands on a new dataset is the potential entities that we can standardize to. This was a fun dataset for me because of all the entities that can be teased out. In addition to the aforementioned entities, we were able to match the terms list to drugs and diseases. The RePORTER database also provides an ID for the principal investigators, but unfortunately, much like the reviewer ID from BMIS, it is not unique. We consolidate those entries. We consolidate different company names to resolve to a unique ID, and then we are ready to go: city, state, country, organization, principal investigator, drug, disease, and time. A robust database for both building our entity profiles and creating cool visualizations.
Some facts we have gleaned from the database:
- Johns Hopkins leads the way in NIH funding since FY2000 (with more than $7.5 billion)
- NIH funding increased steadily from 2000 until peaking in 2010 at $38 billion
- Boston leads the way in funding over that time (score one for Boston in the Boston-New York rivalry)
- NIH funding was not limited to the United States. $5.4 billion were funded outside the US since 2000, with South Africa leading the way
That should give you a flavor for what you can do with the dataset. Try copying one of my datacards and discovering your own insights.
Each week the editors of karmadata choose a Datacard created by one of our members, to recognize their creativity and contributions to the world’s open knowledge base. This week’s Datacard visualizes clinical trial data for a popular biotech company (Gilead) for one of their development programs (Hepatitis C). We copied this card (using its filters) and stamped out ~10 other Datacards, learning about Gilead’s drugs, clinical investigators and more in the Hepatitis C arena. Great work!
Data has meaning, and should inspire ideas and action, the same as words can and often do. Those meanings should be clear and easy to find AND understand, by anyone, anywhere in the world. Society needs to encourage and adopt a data-driven approach to everything, from the impossibly complex (think global warming or healthcare costs), to the commercial (efficiently targeting your customer base), to the much enjoyed social (debating your friends and colleagues on Twitter, LinkedIn or Facebook). Everyone knows “a picture says a thousand words”; well so do facts, and more so facts that have been attractively visualized and easily shared. In a world where we are inundated with marketing messages touting the use of “Big Data” and “Open Data”, sharing and visualizing data seems like no-brainer. If only that were true.
You see, there is a dirty little secret that the Big Data vendors and Open Data zealots won’t tell you, but we will. Big Data is not open, and Open Data is just as closed. What’s that you say? It seems plausible that certain piles of Big Data are not open to the public, think patient health records, or detailed banking transactions. We call this the Big Data Anonymity Problem, and we think we have a solution for it (more on that later). But how can Open Data not be open? I mean the word open is in its title!
This is where we reclaim the true meaning of the word Open as it relates to Data.
Issue #1: Open is not Free
I am sick and tired of the wrong headed association of the word Open with Free. Let me set the record straight: nothing, and I mean absolutely nothing, is free. There is the appearance of free. Google seems free, but you are trading information about yourself to advertisers in exchange for the best search engine in the world, and let’s face it when you need to find a restaurant when traveling, you don’t care about the nuance.
|Wikipedia seems free, but most people don’t know that the online wiki has a $28 million annual operating budget and depends on individual and corporate donations (Google contributed $2 million in 2010) to survive.|
Downloading data from the U.S. Government seems free, but every time you open your paycheck, take a look at that hefty federal and state tax that is paying for the collection and dissemination of data (the National Library of Medicine has spent $3.2bn over 10 years in publishing Public Data, and that is just one example from the U.S. government, there are thousands of other examples from the U.S. and around the world ).
None of these examples are free, they just employ different revenue models (advertising, donations, taxes).
You see, in order to do a technology thing right, it takes resources, and lots of them. Programmers are stubborn like that, they need to get paid (yes, they have mortgages, and college tuitions, and car payments just like you and me). And last time I checked, the folks at Microsoft, Oracle, Dell and HP aren’t starting to give away their software, servers, storage arrays for the common good! And while the marketing guys keep laying it on thick touting the “Cloud” (doesn’t it sound nice and fluffy?) the actual data centers and hardware that make the “Cloud” go aren’t going to be free any time soon (like ever). Read Richard Stallman’s (see GNU project and Free Software) take on this when he says,
“Think of free as in free speech, not as in free beer”.
I love this quote. Richard and I diverge on the best way to fund technology innovation, but for sure his heart is in the right place.
Issue #2: Open really means Public
I also reject the Open Knowledge Foundation’s interpretation of the word Open, partly because of their improper association of the word Open with Free, but mostly because what they are really talking about is Public Data – and Public Data does not always meet my definition of Open (see below). At some point someone in a meeting decided to replace the word Public with Open, and that was a mistake. Public Data really says it all: it is data that is owned by the Public because it was (wait for it…) PAID for by the citizens. Any government or regulatory data falls into this category, anywhere in the world (there are some economic ethical issues around one country’s citizen paying for another country’s access to its data, but that is a topic for another blog post). Public Data also includes any Private Data whose owner decides for one reason or another to release or publish to the Public (think press releases or public domain websites).
Taking Back the Word Open as it Relates to Data
In my book the word Open, as it pertains to Data, means:
- Accessible and Useful. And no you Open Data zealots, a zip file of XML formatted records is not easily accessible nor is it useful …. I mean easily accessible and useful to folks that are not computer programmers. I want end users around the world with the familiarity of just a real web browser (and IE < 8 is not real) to access the highest quality data that exists, for
- Free (Basic accounts: advertising model…thanks Google!);
- Cheap (Premium Accounts: as low as $9.99 a month); and
- Fair prices (Plus accounts: starting at $2,500 a year, scales based on organization size, for those that need anonymity and data to be integrated into their workflow and systems).
- Our freemium (Basic, Premium) accounts require no corporate subscription, as we are going direct to the end user (heads are exploding in the board rooms of Thomson Reuters, McGraw-Hill, Informa, Bloomberg and the like as you are reading this).
- Standardized and Linked. You can almost stop reading here. Until data is standardized with all of the interesting entities (like companies, people, products, countries, cities, etc..), it is really quite useless. Standardized data is intelligent data. Standardized data can be linked to other interesting data sets, allowing you to see the entire picture about a person, place or thing. You can build alerting systems off standardized data. Standardized data can be analyzed and visualized. Standardized data is the s***. Without standardized data, you don’t have data, you have a big pile of goo. And by the way, even you computer programmers out there that can deal with the XML and the parsing the database normalization and indexing, will quite obviously appreciate and value standardized data so much more.
- Searchable. It sounds obvious, and it sort of is. Until you realize that critical Public Data sets like the FDA’s Adverse Event Reporting doesn’t have a search interface. Wow. We believe that even structured data needs a simple, single text box, search. If I want stuff on China, or Pfizer, or Pancreatic Cancer, I just want to type it and go. Yup, we have that.
- Query Ready (it can be analyzed and aggregated). Data needs to be queried, like a dog needs to be walked. The data is just begging for it. Dynamic query engines are gnarly to build, and we have a great one for you to use.
- Visualized. Facts are cool, we love facts. And sometimes all you want is just the facts ma’am. But nothing makes your point for you like an awesome visualization, and we are dedicated to helping you build beautiful visualizations.
- Easily Shared. Last, and most important, data needs to be shared. And in order for that to happen, it has to be easy to share. If it is not easy to share, it isn’t Open. Data needs to be social, and portable, and re-usable. Your friends and colleagues should be able to build on what you started, copying, editing and enhancing to suit their needs. This is the karma behind karmadata.
And Public (sigh “Open”) Data fails miserably on most, if not all, of these points. Some Public Data is better than others, but few are great and none are linked to each other. I started karmadata to help fulfill the promise of Open Data (and Big Data…and Private Data!). Stay tuned, another blog post is coming on that pesky Big Data Anonymity Problem and our ideas on how to open private data up.
One of our key missions is to curate open data sources and provide the back to the world so that the brilliant thought leaders in the industry can use data more effectively and efficiently. This means taking the data in its native format (typically XML, txt, csv, Excel), loading it into our Oracle relational database, standardizing the data to important entities like person, place, organization, drug, disease, and then providing that data for download in a standard text file format.
Since we want other companies to be downloading and using our data, I figured I should go through the process myself and see how long it would take me to download and load some karmadata. I chose one of my favorite datasets, ClinicalTrials.gov, which is published by the National Institutes of Health. I chose ClinicalTrials.gov because it is published in XML with a fairly complex schema and there are plenty of free text fields that make standardization ultra-difficult and important.
We’ve attempted to make getting off the ground with karmadata as quick and easy as possible. Our Toolkit contains all the metadata that you should need, as well as the SQL scripts to load the data (currently complete for Oracle, but will be completed for SQL Server, MySQL, etc. in the near future). The hope was that someone could download the data, load it into a relational database, and answer a hard to answer question in less than a day.
I began by heading to karmadata.com and cruising the available files on the download page. Knowing that I wanted to load ClinicalTrials.gov, I clicked into the Source Files section to view the raw source data, and then to Fact Files sections to check out the standardized records that accompany it. I downloaded all of the available files.
Next I downloaded the files provided in the Toolkit section. I read the readme.doc to take me through the process. I found it to be extremely well written. Whoever authored it must be incredibly brilliant and good looking. I identified the scripts for creating the tables for fact and source data, as well as the external table scripts to load the data into those tables.
Then I got started. I created the tables for loading, unzipped the first period of data, and ran the inserts to load the data. Rather than programmatically unzipping and loading the data, I simply manually unzipped and ran the inserts as I went.
Ten minutes after I read the readme document, I had the entire ClinicalTrials.gov dataset loaded into a relational database, and best of all it was standardized to entities for sponsor organization, clinical sites, clinical investigators, geography, disease, drug, and time.
Now the fun part. The last thing that we provide in the toolkit is a couple of queries to get you started to play around with the data. In this case we ask the question, which are the leading sites in running industry sponsored, neurodegenerative disease trials, from 2009 to 2012? I run the query, and boom, I’m looking at a list that looks like a less attractive version of this data visualization.
Now just to recap what it would take to run that from scratch, you would need to go to ClinicalTrials.gov, download the entire dataset in XML, load the XML into a relational database, standardize the start dates to dates, standardize the many versions of each site name to a standard identifier, then group together all of the MeSH terms that fall under neurodegenerative diseases, and then run a query similar to the one we provided.
These are enormous barriers to entry to a functional, effective way of using the data. But what took us countless hours of development, can take you about 10 minutes. (Or you could just find or create a datacard on karmadata.com in about 10 seconds, but you get the point.)
Using our service was a little surreal for me. I was downloading data that I had downloaded, loaded, and standardized, and then was loading it back into an Oracle database. But it left me wishing that I could just use something like karmadata instead of dealing with all the pains that come with unstandardized data sets. Hopefully it will make you feel the same way.
In the spirit of our blog name, #dataShows, I figured I should use some open data to “show something.” (We want to try to do this every so often, be it here or twitter, and encourage the masses to chime in too. If you’d like to guest blog, shoot us a message. The more the merrier.)
A recent topic in the presidential debates has been foreign oil dependence. Governor Romney has voiced his plans to reduce dependence on foreign oil, and instead rely more heavily on the Americas and/or natural gas. President Obama wouldn’t really disagree with this sentiment either. Now without turning this into a political discussion (I see more than my fair share of political commentary on Facebook), I got to thinking: how has foreign oil dependence been trending over the last several years?
We can answer that in a broad sense with the Energy Information Administration’s database of monthly petroleum imports. This first datacard trends the number of barrels of petroleum imported into the US by region. #dataShows that the Americas have been steadily on the rise, particularly since 2006 when Asia and Africa began to decline.
Examining further, #dataShows that Canada has been steadily on the rise and has seemingly replaced much of the oil that came from Saudi Arabia, Nigeria, and other leading importers of oil.