Data has meaning, and should inspire ideas and action, the same as words can and often do. Those meanings should be clear and easy to find AND understand, by anyone, anywhere in the world. Society needs to encourage and adopt a data-driven approach to everything, from the impossibly complex (think global warming or healthcare costs), to the commercial (efficiently targeting your customer base), to the much enjoyed social (debating your friends and colleagues on Twitter, LinkedIn or Facebook). Everyone knows “a picture says a thousand words”; well so do facts, and more so facts that have been attractively visualized and easily shared. In a world where we are inundated with marketing messages touting the use of “Big Data” and “Open Data”, sharing and visualizing data seems like no-brainer. If only that were true.
You see, there is a dirty little secret that the Big Data vendors and Open Data zealots won’t tell you, but we will. Big Data is not open, and Open Data is just as closed. What’s that you say? It seems plausible that certain piles of Big Data are not open to the public, think patient health records, or detailed banking transactions. We call this the Big Data Anonymity Problem, and we think we have a solution for it (more on that later). But how can Open Data not be open? I mean the word open is in its title!
This is where we reclaim the true meaning of the word Open as it relates to Data.
Issue #1: Open is not Free
I am sick and tired of the wrong headed association of the word Open with Free. Let me set the record straight: nothing, and I mean absolutely nothing, is free. There is the appearance of free. Google seems free, but you are trading information about yourself to advertisers in exchange for the best search engine in the world, and let’s face it when you need to find a restaurant when traveling, you don’t care about the nuance.
Downloading data from the U.S. Government seems free, but every time you open your paycheck, take a look at that hefty federal and state tax that is paying for the collection and dissemination of data (the National Library of Medicine has spent $3.2bn over 10 years in publishing Public Data, and that is just one example from the U.S. government, there are thousands of other examples from the U.S. and around the world ).
None of these examples are free, they just employ different revenue models (advertising, donations, taxes).
You see, in order to do a technology thing right, it takes resources, and lots of them. Programmers are stubborn like that, they need to get paid (yes, they have mortgages, and college tuitions, and car payments just like you and me). And last time I checked, the folks at Microsoft, Oracle, Dell and HP aren’t starting to give away their software, servers, storage arrays for the common good! And while the marketing guys keep laying it on thick touting the “Cloud” (doesn’t it sound nice and fluffy?) the actual data centers and hardware that make the “Cloud” go aren’t going to be free any time soon (like ever). Read Richard Stallman’s (see GNU project and Free Software) take on this when he says,
“Think of free as in free speech, not as in free beer”.
I love this quote. Richard and I diverge on the best way to fund technology innovation, but for sure his heart is in the right place.
Issue #2: Open really means Public
I also reject the Open Knowledge Foundation’s interpretation of the word Open, partly because of their improper association of the word Open with Free, but mostly because what they are really talking about is Public Data – and Public Data does not always meet my definition of Open (see below). At some point someone in a meeting decided to replace the word Public with Open, and that was a mistake. Public Data really says it all: it is data that is owned by the Public because it was (wait for it…) PAID for by the citizens. Any government or regulatory data falls into this category, anywhere in the world (there are some economic ethical issues around one country’s citizen paying for another country’s access to its data, but that is a topic for another blog post). Public Data also includes any Private Data whose owner decides for one reason or another to release or publish to the Public (think press releases or public domain websites).
Taking Back the Word Open as it Relates to Data
In my book the word Open, as it pertains to Data, means:
- Accessible and Useful. And no you Open Data zealots, a zip file of XML formatted records is not easily accessible nor is it useful …. I mean easily accessible and useful to folks that are not computer programmers. I want end users around the world with the familiarity of just a real web browser (and IE < 8 is not real) to access the highest quality data that exists, for
- Free (Basic accounts: advertising model…thanks Google!);
- Cheap (Premium Accounts: as low as $9.99 a month); and
- Fair prices (Plus accounts: starting at $2,500 a year, scales based on organization size, for those that need anonymity and data to be integrated into their workflow and systems).
- Our freemium (Basic, Premium) accounts require no corporate subscription, as we are going direct to the end user (heads are exploding in the board rooms of Thomson Reuters, McGraw-Hill, Informa, Bloomberg and the like as you are reading this).
- Standardized and Linked. You can almost stop reading here. Until data is standardized with all of the interesting entities (like companies, people, products, countries, cities, etc..), it is really quite useless. Standardized data is intelligent data. Standardized data can be linked to other interesting data sets, allowing you to see the entire picture about a person, place or thing. You can build alerting systems off standardized data. Standardized data can be analyzed and visualized. Standardized data is the s***. Without standardized data, you don’t have data, you have a big pile of goo. And by the way, even you computer programmers out there that can deal with the XML and the parsing the database normalization and indexing, will quite obviously appreciate and value standardized data so much more.
- Searchable. It sounds obvious, and it sort of is. Until you realize that critical Public Data sets like the FDA’s Adverse Event Reporting doesn’t have a search interface. Wow. We believe that even structured data needs a simple, single text box, search. If I want stuff on China, or Pfizer, or Pancreatic Cancer, I just want to type it and go. Yup, we have that.
- Query Ready (it can be analyzed and aggregated). Data needs to be queried, like a dog needs to be walked. The data is just begging for it. Dynamic query engines are gnarly to build, and we have a great one for you to use.
- Visualized. Facts are cool, we love facts. And sometimes all you want is just the facts ma’am. But nothing makes your point for you like an awesome visualization, and we are dedicated to helping you build beautiful visualizations.
- Easily Shared. Last, and most important, data needs to be shared. And in order for that to happen, it has to be easy to share. If it is not easy to share, it isn’t Open. Data needs to be social, and portable, and re-usable. Your friends and colleagues should be able to build on what you started, copying, editing and enhancing to suit their needs. This is the karma behind karmadata.
And Public (sigh “Open”) Data fails miserably on most, if not all, of these points. Some Public Data is better than others, but few are great and none are linked to each other. I started karmadata to help fulfill the promise of Open Data (and Big Data…and Private Data!). Stay tuned, another blog post is coming on that pesky Big Data Anonymity Problem and our ideas on how to open private data up.