Archives For open data

Keeping Open Data…Open

Sean Power —  March 20, 2013 — 2 Comments

Data has meaning, and should inspire ideas and action, the same as words can and often do.  Those meanings should be clear and easy to find AND understand, by anyone, anywhere in the world.  Society needs to encourage and adopt a data-driven approach to everything, from the impossibly complex (think global warming or healthcare costs), to the commercial (efficiently targeting your customer base), to the much enjoyed social (debating your friends and colleagues on Twitter, LinkedIn or Facebook).  Everyone knows “a picture says a thousand words”; well so do facts, and more so facts that have been attractively visualized and easily shared.  In a world where we are inundated with marketing messages touting the use of “Big Data” and “Open Data”, sharing and visualizing data seems like no-brainer.  If only that were true.

You see, there is a dirty little secret that the Big Data vendors and Open Data zealots won’t tell you, but we will.  Big Data is not open, and Open Data is just as closed.  What’s that you say?  It seems plausible that certain piles of Big Data are not open to the public, think patient health records, or detailed banking transactions.  We call this the Big Data Anonymity Problem, and we think we have a solution for it (more on that later).  But how can Open Data not be open?  I mean the word open is in its title!

This is where we reclaim the true meaning of the word Open as it relates to Data.

Issue #1: Open is not Free

I am sick and tired of the wrong headed association of the word Open with Free.  Let me set the record straight: nothing, and I mean absolutely nothing, is free.  There is the appearance of free.  Google seems free, but you are trading information about yourself to advertisers in exchange for the best search engine in the world, and let’s face it when you need to find a restaurant when traveling, you don’t care about the nuance.

wikibudget Wikipedia seems free, but most people don’t know that the online wiki has a $28 million annual operating budget and depends on individual and corporate donations (Google contributed $2 million in 2010) to survive.  

Downloading data from the U.S. Government seems free, but every time you open your paycheck, take a look at that hefty federal and state tax that is paying for the collection and dissemination of data (the National Library of Medicine has spent $3.2bn over 10 years in publishing Public Data, and that is just one example from the U.S. government, there are thousands of other examples from the U.S. and around the world ).

None of these examples are free, they just employ different revenue models (advertising, donations, taxes).

You see, in order to do a technology thing right, it takes resources, and lots of them. Programmers are stubborn like that, they need to get paid (yes, they have mortgages, and college tuitions, and car payments just like you and me).  And last time I checked, the folks at Microsoft, Oracle, Dell and HP aren’t starting to give away their software, servers, storage arrays for the common good!  And while the marketing guys keep laying it on thick touting the “Cloud” (doesn’t it sound nice and fluffy?) the actual data centers and hardware that make the “Cloud” go aren’t going to be free any time soon (like ever).  Read Richard Stallman’s (see GNU project and Free Software) take on this when he says,

“Think of free as in free speech, not as in free beer”.

I love this quote. Richard and I diverge on the best way to fund technology innovation, but for sure his heart is in the right place.

Issue #2: Open really means Public

I also reject the Open Knowledge Foundation’s interpretation of the word Open, partly because of their improper association of the word Open with Free, but mostly because what they are really talking about is Public Data – and Public Data does not always meet my definition of Open (see below).  At some point someone in a meeting decided to replace the word Public with Open, and that was a mistake.  Public Data really says it all: it is data that is owned by the Public because it was (wait for it…) PAID for by the citizens.  Any government or regulatory data falls into this category, anywhere in the world (there are some economic ethical issues around one country’s citizen paying for another country’s access to its data, but that is a topic for another blog post).  Public Data also includes any Private Data whose owner decides for one reason or another to release or publish to the Public (think press releases or public domain websites).

Taking Back the Word Open as it Relates to Data

In my book the word Open, as it pertains to Data, means:

  1. Accessible and Useful.  And no you Open Data zealots, a zip file of XML formatted records is not easily accessible nor is it useful …. I mean easily accessible and useful to folks that are not computer programmers.  I want end users around the world with the familiarity of just a real web browser (and IE < 8 is not real) to access the highest quality data that exists, for
    • Free (Basic accounts: advertising model…thanks Google!);
    • Cheap (Premium Accounts: as low as $9.99 a month); and
    • Fair prices (Plus accounts: starting at $2,500 a year, scales based on organization size, for those that need anonymity and data to be integrated into their workflow and systems).
    • Our freemium (Basic, Premium) accounts require no corporate subscription, as we are going direct to the end user (heads are exploding in the board rooms of Thomson Reuters, McGraw-Hill, Informa, Bloomberg and the like as you are reading this).
  2. Standardized and Linked.  You can almost stop reading here.  Until data is standardized with all of the interesting entities (like companies, people, products, countries, cities, etc..), it is really quite useless.  Standardized data is intelligent data.  Standardized data can be linked to other interesting data sets, allowing you to see the entire picture about a person, place or thing.  You can build alerting systems off standardized data.  Standardized data can be analyzed and visualized.  Standardized data is the s***.  Without standardized data, you don’t have data, you have a big pile of goo.  And by the way, even you computer programmers out there that can deal with the XML and the parsing the database normalization and indexing, will quite obviously appreciate and value standardized data so much more.
  3. Searchable.  It sounds obvious, and it sort of is.  Until you realize that critical Public Data sets like the FDA’s Adverse Event Reporting doesn’t have a search interface.  Wow.  We believe that even structured data needs a simple, single text box, search.  If I want stuff on China, or Pfizer, or Pancreatic Cancer, I just want to type it and go.  Yup, we have that.
  4. Query Ready (it can be analyzed and aggregated).  Data needs to be queried, like a dog needs to be walked.  The data is just begging for it.  Dynamic query engines are gnarly to build, and we have a great one for you to use.
  5. Visualized.   Facts are cool, we love facts.  And sometimes all you want is just the facts ma’am.  But nothing makes your point for you like an awesome visualization, and we are dedicated to helping you build beautiful visualizations.
  6. Easily Shared.  Last, and most important, data needs to be shared.  And in order for that to happen, it has to be easy to share.  If it is not easy to share, it isn’t Open.  Data needs to be social, and portable, and re-usable.  Your friends and colleagues should be able to build on what you started, copying, editing and enhancing to suit their needs.  This is the karma behind karmadata.

And Public (sigh “Open”) Data fails miserably on most, if not all, of these points.  Some Public Data is better than others, but few are great and none are linked to each other.  I started karmadata to help fulfill the promise of Open Data (and Big Data…and Private Data!).  Stay tuned, another blog post is coming on that pesky Big Data Anonymity Problem and our ideas on how to open private data up.

One of our key missions is to curate open data sources and provide the back to the world so that the brilliant thought leaders in the industry can use data more effectively and efficiently.  This means taking the data in its native format (typically XML, txt, csv, Excel), loading it into our Oracle relational database, standardizing the data to important entities like person, place, organization, drug, disease, and then providing that data for download in a standard text file format.

Since we want other companies to be downloading and using our data, I figured I should go through the process myself and see how long it would take me to download and load some karmadata.  I chose one of my favorite datasets, ClinicalTrials.gov, which is published by the National Institutes of Health.  I chose ClinicalTrials.gov because it is published in XML with a fairly complex schema and there are plenty of free text fields that make standardization ultra-difficult and important.

We’ve attempted to make getting off the ground with karmadata as quick and easy as possible.  Our Toolkit contains all the metadata that you should need, as well as the SQL scripts to load the data (currently complete for Oracle, but will be completed for SQL Server, MySQL, etc. in the near future).  The hope was that someone could download the data, load it into a relational database, and answer a hard to answer question in less than a day.

I began by heading to karmadata.com and cruising the available files on the download page.  Knowing that I wanted to load ClinicalTrials.gov, I clicked into the Source Files section to view the raw source data, and then to Fact Files sections to check out the standardized records that accompany it.  I downloaded all of the available files.

Next I downloaded the files provided in the Toolkit section. I read the readme.doc to take me through the process.  I found it to be extremely well written.  Whoever authored it must be incredibly brilliant and good looking.  I identified the scripts for creating the tables for fact and source data, as well as the external table scripts to load the data into those tables.

Then I got started.  I created the tables for loading, unzipped the first period of data, and ran the inserts to load the data.  Rather than programmatically unzipping and loading the data, I simply manually unzipped and ran the inserts as I went.

Ten minutes after I read the readme document, I had the entire ClinicalTrials.gov dataset loaded into a relational database, and best of all it was standardized to entities for sponsor organization, clinical sites, clinical investigators, geography, disease, drug, and time.

Now the fun part.  The last thing that we provide in the toolkit is a couple of queries to get you started to play around with the data.  In this case we ask the question, which are the leading sites in running industry sponsored, neurodegenerative disease trials, from 2009 to 2012?  I run the query, and boom, I’m looking at a list that looks like a less attractive version of this data visualization.

Image

You could download ClinicalTrials.gov from karmadata, or you could just create this data visualization on karmadata

Now just to recap what it would take to run that from scratch, you would need to go to ClinicalTrials.gov, download the entire dataset in XML, load the XML into a relational database, standardize the start dates to dates, standardize the many versions of each site name to a standard identifier, then group together all of the MeSH terms that fall under neurodegenerative diseases, and then run a query similar to the one we provided.

These are enormous barriers to entry to a functional, effective way of using the data.  But what took us countless hours of development, can take you about 10 minutes.  (Or you could just find or create a datacard on karmadata.com in about 10 seconds, but you get the point.)

Using our service was a little surreal for me.  I was downloading data that I had downloaded, loaded, and standardized, and then was loading it back into an Oracle database.  But it left me wishing that I could just use something like karmadata instead of dealing with all the pains that come with unstandardized data sets.  Hopefully it will make you feel the same way.

Disruptive Innovations to Advance Clinical Trials, Envizualized

Conference illustration by Jonny Goldstein of Envizualize

Sean and I recently attended the 2nd Annual Disruptive Innovations to Advance Clinical Trials for Pharma, Biologics, & Devices (DPharm) in Boston. We had a really great time and learned a ton.

Here are a few themes that really stood out for me:

An increasing momentum for collaboration among pharma:

Tom Krohn of Eli Lilly kicked off the conference by bluntly telling us why we need to change the current industry paradigm: because it’s completely unsustainable. Yikes. While radical changes in this industry might feel like trying to pull a 180 in the Titanic, progress is being made.

Elise Felicione of Janssen detailed a cross-pharma investigator databank that is advancing. Janssen, Merck, and Lilly are collaborating on a project that combines their investigator lists into a database hosted by DrugDev.org (we want to be a part of this effort!). While the project has only reached exploratory stages to this point, they have gotten past some very difficult hurdles involving lawyers, red tape, and the like. Did you know about all of the redundant tasks that take place between sponsor and investigator, like how investigators have to go through Good Clinical Practice training with each sponsor that they work with? As a data geek residing on the peripheries of the industry, I did not. Reducing these “redundant burdens” is a no-brainer, but implementing is no easy task. It will take indomitable leaders to overcome the resistance to sharing competitive intelligence, but luckily it appears that such leaders are in place.

Continue Reading…