Archives For Data Visualization
karmadata enables users to visualize and simply query the world’s healthcare data. So why can’t the excel download sheets be simple to navigate as well? Now they are! The format of the existing download has greatly improved. We have cleared away unnecessary columns and provide only the information that is important to you. We have organized the columns and also added a different tab for clinical trials. Now the user is able to toggle back and forth between a tab for clinical trials, and a tab for sponsors.
At the karmadata HQ the developers have been working diligently to create more filters to expand the abilities that Sponsor Finder has. There have been multiple additions already and we want to share them with you. The first addition that has been made is the visualizations of trial timelines. This gives the user the ability to quickly see a comparison between the actual vs. expected for a trial as well as different events that have occured.
After running a Sponsor Finder search, you will be able to find these visualizations by clicking on the number of trials in the middle of the green circle. This will bring you to the list of clinical trials page. Right away you are able to see the trial timelines right under it’s trial listing.
Some of the events that you would be able to see on the timeline include: original/actual start and end dates, addition of trial sites, when a trial is announced, and enrollment dates. The blue line is the current timeline and grey is the original timeline. This will give our clients the ability to easily see trial delays/events instead of having to dig into the source. In the end, saves you and your team time-which is what Sponsor Finder aims to do! More updates on Sponsor Finder additions to follow.
Yesterday there was quite a bit of buzz about the release of Medicare’s inpatient payments for the top 100 diagnosis related groups. The Washington Post published some highlights on the data including a neat widget to visualize the data. We decided to take our own shot at it. This was a fun dataset for us since we were able to leverage a ton of work that we’ve already done. We had already standardized entities for hospital, organization, DRG, and city from other CMS datasets. I downloaded the data at 3 PM and had it up and running on karmadata by 5. We added a couple of calculated measures for total amount paid by Medicare and discrepancy between amount charged and amount received, and started making datacards.
Here’s what folks are saying:
— Brendan Kelleher (@karmadataBK) May 8, 2013
From time to time we’ll highlight a data set on karmadata. Today I’ll provide a quick look at the NIH RePORTER grants database.
The RePORTER database (which replaced the old CRISP database) “provides access to reports, data, and analyses of NIH research activities, including information on NIH expenditures and the results of NIH supported research.” In other words, we get to see our tax dollars at work.
When looking at these data sets I’ll try to highlight what is great about the source data/website (I can’t just be complaining all the time), and then highlight the value that we’re able to add.
The data itself (provided in both csv and XML) contains the funding agency (NIH, NCI, etc), the organization receiving the grant, the location, the principal investigators running the study, a list of terms associated with the project, and the amount funded for the project. The RePORTER website has some pretty nice functionality for aggregating and ranking by those different entities. You can play around with that tool here. You can even map the data and drill down to view grants awarded to different states. Neat. The greatest limitation is probably the fact that you can only analyze the data one fiscal year at a time, but overall it’s a pretty nice presentation of the data.
The first thing I look for when I get my hands on a new dataset is the potential entities that we can standardize to. This was a fun dataset for me because of all the entities that can be teased out. In addition to the aforementioned entities, we were able to match the terms list to drugs and diseases. The RePORTER database also provides an ID for the principal investigators, but unfortunately, much like the reviewer ID from BMIS, it is not unique. We consolidate those entries. We consolidate different company names to resolve to a unique ID, and then we are ready to go: city, state, country, organization, principal investigator, drug, disease, and time. A robust database for both building our entity profiles and creating cool visualizations.
Some facts we have gleaned from the database:
- Johns Hopkins leads the way in NIH funding since FY2000 (with more than $7.5 billion)
- NIH funding increased steadily from 2000 until peaking in 2010 at $38 billion
- Boston leads the way in funding over that time (score one for Boston in the Boston-New York rivalry)
- NIH funding was not limited to the United States. $5.4 billion were funded outside the US since 2000, with South Africa leading the way
That should give you a flavor for what you can do with the dataset. Try copying one of my datacards and discovering your own insights.
Data has meaning, and should inspire ideas and action, the same as words can and often do. Those meanings should be clear and easy to find AND understand, by anyone, anywhere in the world. Society needs to encourage and adopt a data-driven approach to everything, from the impossibly complex (think global warming or healthcare costs), to the commercial (efficiently targeting your customer base), to the much enjoyed social (debating your friends and colleagues on Twitter, LinkedIn or Facebook). Everyone knows “a picture says a thousand words”; well so do facts, and more so facts that have been attractively visualized and easily shared. In a world where we are inundated with marketing messages touting the use of “Big Data” and “Open Data”, sharing and visualizing data seems like no-brainer. If only that were true.
You see, there is a dirty little secret that the Big Data vendors and Open Data zealots won’t tell you, but we will. Big Data is not open, and Open Data is just as closed. What’s that you say? It seems plausible that certain piles of Big Data are not open to the public, think patient health records, or detailed banking transactions. We call this the Big Data Anonymity Problem, and we think we have a solution for it (more on that later). But how can Open Data not be open? I mean the word open is in its title!
This is where we reclaim the true meaning of the word Open as it relates to Data.
Issue #1: Open is not Free
I am sick and tired of the wrong headed association of the word Open with Free. Let me set the record straight: nothing, and I mean absolutely nothing, is free. There is the appearance of free. Google seems free, but you are trading information about yourself to advertisers in exchange for the best search engine in the world, and let’s face it when you need to find a restaurant when traveling, you don’t care about the nuance.
|Wikipedia seems free, but most people don’t know that the online wiki has a $28 million annual operating budget and depends on individual and corporate donations (Google contributed $2 million in 2010) to survive.|
Downloading data from the U.S. Government seems free, but every time you open your paycheck, take a look at that hefty federal and state tax that is paying for the collection and dissemination of data (the National Library of Medicine has spent $3.2bn over 10 years in publishing Public Data, and that is just one example from the U.S. government, there are thousands of other examples from the U.S. and around the world ).
None of these examples are free, they just employ different revenue models (advertising, donations, taxes).
You see, in order to do a technology thing right, it takes resources, and lots of them. Programmers are stubborn like that, they need to get paid (yes, they have mortgages, and college tuitions, and car payments just like you and me). And last time I checked, the folks at Microsoft, Oracle, Dell and HP aren’t starting to give away their software, servers, storage arrays for the common good! And while the marketing guys keep laying it on thick touting the “Cloud” (doesn’t it sound nice and fluffy?) the actual data centers and hardware that make the “Cloud” go aren’t going to be free any time soon (like ever). Read Richard Stallman’s (see GNU project and Free Software) take on this when he says,
“Think of free as in free speech, not as in free beer”.
I love this quote. Richard and I diverge on the best way to fund technology innovation, but for sure his heart is in the right place.
Issue #2: Open really means Public
I also reject the Open Knowledge Foundation’s interpretation of the word Open, partly because of their improper association of the word Open with Free, but mostly because what they are really talking about is Public Data – and Public Data does not always meet my definition of Open (see below). At some point someone in a meeting decided to replace the word Public with Open, and that was a mistake. Public Data really says it all: it is data that is owned by the Public because it was (wait for it…) PAID for by the citizens. Any government or regulatory data falls into this category, anywhere in the world (there are some economic ethical issues around one country’s citizen paying for another country’s access to its data, but that is a topic for another blog post). Public Data also includes any Private Data whose owner decides for one reason or another to release or publish to the Public (think press releases or public domain websites).
Taking Back the Word Open as it Relates to Data
In my book the word Open, as it pertains to Data, means:
- Accessible and Useful. And no you Open Data zealots, a zip file of XML formatted records is not easily accessible nor is it useful …. I mean easily accessible and useful to folks that are not computer programmers. I want end users around the world with the familiarity of just a real web browser (and IE < 8 is not real) to access the highest quality data that exists, for
- Free (Basic accounts: advertising model…thanks Google!);
- Cheap (Premium Accounts: as low as $9.99 a month); and
- Fair prices (Plus accounts: starting at $2,500 a year, scales based on organization size, for those that need anonymity and data to be integrated into their workflow and systems).
- Our freemium (Basic, Premium) accounts require no corporate subscription, as we are going direct to the end user (heads are exploding in the board rooms of Thomson Reuters, McGraw-Hill, Informa, Bloomberg and the like as you are reading this).
- Standardized and Linked. You can almost stop reading here. Until data is standardized with all of the interesting entities (like companies, people, products, countries, cities, etc..), it is really quite useless. Standardized data is intelligent data. Standardized data can be linked to other interesting data sets, allowing you to see the entire picture about a person, place or thing. You can build alerting systems off standardized data. Standardized data can be analyzed and visualized. Standardized data is the s***. Without standardized data, you don’t have data, you have a big pile of goo. And by the way, even you computer programmers out there that can deal with the XML and the parsing the database normalization and indexing, will quite obviously appreciate and value standardized data so much more.
- Searchable. It sounds obvious, and it sort of is. Until you realize that critical Public Data sets like the FDA’s Adverse Event Reporting doesn’t have a search interface. Wow. We believe that even structured data needs a simple, single text box, search. If I want stuff on China, or Pfizer, or Pancreatic Cancer, I just want to type it and go. Yup, we have that.
- Query Ready (it can be analyzed and aggregated). Data needs to be queried, like a dog needs to be walked. The data is just begging for it. Dynamic query engines are gnarly to build, and we have a great one for you to use.
- Visualized. Facts are cool, we love facts. And sometimes all you want is just the facts ma’am. But nothing makes your point for you like an awesome visualization, and we are dedicated to helping you build beautiful visualizations.
- Easily Shared. Last, and most important, data needs to be shared. And in order for that to happen, it has to be easy to share. If it is not easy to share, it isn’t Open. Data needs to be social, and portable, and re-usable. Your friends and colleagues should be able to build on what you started, copying, editing and enhancing to suit their needs. This is the karma behind karmadata.
And Public (sigh “Open”) Data fails miserably on most, if not all, of these points. Some Public Data is better than others, but few are great and none are linked to each other. I started karmadata to help fulfill the promise of Open Data (and Big Data…and Private Data!). Stay tuned, another blog post is coming on that pesky Big Data Anonymity Problem and our ideas on how to open private data up.