Data has meaning, and should inspire ideas and action, the same as words can and often do. Those meanings should be clear and easy to find AND understand, by anyone, anywhere in the world. Society needs to encourage and adopt a data-driven approach to everything, from the impossibly complex (think global warming or healthcare costs), to the commercial (efficiently targeting your customer base), to the much enjoyed social (debating your friends and colleagues on Twitter, LinkedIn or Facebook). Everyone knows “a picture says a thousand words”; well so do facts, and more so facts that have been attractively visualized and easily shared. In a world where we are inundated with marketing messages touting the use of “Big Data” and “Open Data”, sharing and visualizing data seems like no-brainer. If only that were true.
You see, there is a dirty little secret that the Big Data vendors and Open Data zealots won’t tell you, but we will. Big Data is not open, and Open Data is just as closed. What’s that you say? It seems plausible that certain piles of Big Data are not open to the public, think patient health records, or detailed banking transactions. We call this the Big Data Anonymity Problem, and we think we have a solution for it (more on that later). But how can Open Data not be open? I mean the word open is in its title!
This is where we reclaim the true meaning of the word Open as it relates to Data.
Issue #1: Open is not Free
I am sick and tired of the wrong headed association of the word Open with Free. Let me set the record straight: nothing, and I mean absolutely nothing, is free. There is the appearance of free. Google seems free, but you are trading information about yourself to advertisers in exchange for the best search engine in the world, and let’s face it when you need to find a restaurant when traveling, you don’t care about the nuance.
![]() |
Wikipedia seems free, but most people don’t know that the online wiki has a $28 million annual operating budget and depends on individual and corporate donations (Google contributed $2 million in 2010) to survive. |
Downloading data from the U.S. Government seems free, but every time you open your paycheck, take a look at that hefty federal and state tax that is paying for the collection and dissemination of data (the National Library of Medicine has spent $3.2bn over 10 years in publishing Public Data, and that is just one example from the U.S. government, there are thousands of other examples from the U.S. and around the world ).
None of these examples are free, they just employ different revenue models (advertising, donations, taxes).
You see, in order to do a technology thing right, it takes resources, and lots of them. Programmers are stubborn like that, they need to get paid (yes, they have mortgages, and college tuitions, and car payments just like you and me). And last time I checked, the folks at Microsoft, Oracle, Dell and HP aren’t starting to give away their software, servers, storage arrays for the common good! And while the marketing guys keep laying it on thick touting the “Cloud” (doesn’t it sound nice and fluffy?) the actual data centers and hardware that make the “Cloud” go aren’t going to be free any time soon (like ever). Read Richard Stallman’s (see GNU project and Free Software) take on this when he says,
“Think of free as in free speech, not as in free beer”.
I love this quote. Richard and I diverge on the best way to fund technology innovation, but for sure his heart is in the right place.
Issue #2: Open really means Public
I also reject the Open Knowledge Foundation’s interpretation of the word Open, partly because of their improper association of the word Open with Free, but mostly because what they are really talking about is Public Data – and Public Data does not always meet my definition of Open (see below). At some point someone in a meeting decided to replace the word Public with Open, and that was a mistake. Public Data really says it all: it is data that is owned by the Public because it was (wait for it…) PAID for by the citizens. Any government or regulatory data falls into this category, anywhere in the world (there are some economic ethical issues around one country’s citizen paying for another country’s access to its data, but that is a topic for another blog post). Public Data also includes any Private Data whose owner decides for one reason or another to release or publish to the Public (think press releases or public domain websites).
Taking Back the Word Open as it Relates to Data
In my book the word Open, as it pertains to Data, means:
- Accessible and Useful. And no you Open Data zealots, a zip file of XML formatted records is not easily accessible nor is it useful …. I mean easily accessible and useful to folks that are not computer programmers. I want end users around the world with the familiarity of just a real web browser (and IE < 8 is not real) to access the highest quality data that exists, for
- Free (Basic accounts: advertising model…thanks Google!);
- Cheap (Premium Accounts: as low as $9.99 a month); and
- Fair prices (Plus accounts: starting at $2,500 a year, scales based on organization size, for those that need anonymity and data to be integrated into their workflow and systems).
- Our freemium (Basic, Premium) accounts require no corporate subscription, as we are going direct to the end user (heads are exploding in the board rooms of Thomson Reuters, McGraw-Hill, Informa, Bloomberg and the like as you are reading this).
- Standardized and Linked. You can almost stop reading here. Until data is standardized with all of the interesting entities (like companies, people, products, countries, cities, etc..), it is really quite useless. Standardized data is intelligent data. Standardized data can be linked to other interesting data sets, allowing you to see the entire picture about a person, place or thing. You can build alerting systems off standardized data. Standardized data can be analyzed and visualized. Standardized data is the s***. Without standardized data, you don’t have data, you have a big pile of goo. And by the way, even you computer programmers out there that can deal with the XML and the parsing the database normalization and indexing, will quite obviously appreciate and value standardized data so much more.
- Searchable. It sounds obvious, and it sort of is. Until you realize that critical Public Data sets like the FDA’s Adverse Event Reporting doesn’t have a search interface. Wow. We believe that even structured data needs a simple, single text box, search. If I want stuff on China, or Pfizer, or Pancreatic Cancer, I just want to type it and go. Yup, we have that.
- Query Ready (it can be analyzed and aggregated). Data needs to be queried, like a dog needs to be walked. The data is just begging for it. Dynamic query engines are gnarly to build, and we have a great one for you to use.
- Visualized. Facts are cool, we love facts. And sometimes all you want is just the facts ma’am. But nothing makes your point for you like an awesome visualization, and we are dedicated to helping you build beautiful visualizations.
- Easily Shared. Last, and most important, data needs to be shared. And in order for that to happen, it has to be easy to share. If it is not easy to share, it isn’t Open. Data needs to be social, and portable, and re-usable. Your friends and colleagues should be able to build on what you started, copying, editing and enhancing to suit their needs. This is the karma behind karmadata.
And Public (sigh “Open”) Data fails miserably on most, if not all, of these points. Some Public Data is better than others, but few are great and none are linked to each other. I started karmadata to help fulfill the promise of Open Data (and Big Data…and Private Data!). Stay tuned, another blog post is coming on that pesky Big Data Anonymity Problem and our ideas on how to open private data up.
Thanks for a very insightful and engaging article. Your perspective on the difference between open and free datasets was an important one. However, some recent learning in this area leads me to believe that datasets should have a clear end use-case for them to be economically viable for the seller and the buyer of such datasets.
I attended a conference recently, which had a workshop on data marketplaces. For the benefit of those who did not attend the conference and this particular workshop, I’d like to summarize some of the key messages that were discussed:
1) Having a verticalized use case for a dataset (whether open or closed) is extremely important. A gentleman in the conference mentioned ancestry.com as a data market of sorts. For example, the site describes the service as something that examines DNA, to identify users’ ethnic make-up and shows them where they came from. Each advertisement of ancestry on television is a use case in itself.
2) Some other examples of how data gets used were discussed were lifelock (identity theft protection), weatherunderground (weather information that uses different types of data including sensory data) and of course google maps, discussed in the specific context of crowdsourcing for adding and correcting data.
3) Some of the key issues with data sets are a) getting updated data and b) real-time updates are vital.
4) Owners of data have traditionally controlled their use. Incumbents control this market. There is a cost attached to aggregating datasets and therefore they cannot be free. However, with large incumbents control a lot of data that is out there. For example, Bloomberg. From a start-up perspective, I’d like to ask this forum for ideas on how a start-up can make a headway into an environment, where its difficult to recover the costs of datasets without clear end-use cases and where large players are already dominant.
I look forward to learning from viewpoints of other folks in this discussion. Thanks
Thanks so much for contributing your thoughts. I agree that a clear end use-case is a necessary component of any service, not just data. Some thoughts to further the discussion:
1. We believe it is best for the user to determine his or her end use case. In order to make your data business viable you should have expertise in a certain area that drives your data acquisition and UI strategy, where there is critical mass of users to support your data idea.
2. When you get right down to it, all Public data is interesting to somebody (otherwise, why bother regulating the industry and requiring the data to be reported to the government!). The trick is whether you can identify the use case…and this is where your industry expertise comes in as well as significant diligence with your potential user base.
3. People pay for needs, wants, AND convenience. People care more about quality than cost. People pay the most for unmet needs, less for wants and convenience.
4. The cost of doing things does continue to drop, but there remains a significant financial hurdle for new data businesses to get off the ground (don’t believe the Cloud hype!). Servers, software, programmers are all quite expensive (even if you go the open source route for software). The toughest cost to control is the human cost – quality programmers, data scientists, analysts, are all quite expensive. If you cheap out here, your business will fail.
What we are trying to do is lower the hurdle for new data businesses by doing a significant amount of the data processing and standardization work, so that viable app builders can come to us and for a fraction of the cost of doing it themselves can obtain high quality standardized data (aka karmadata) that suits their business idea from #1. Then the app builder can focus their attention on the end user, perhaps some proprietary modeling off the karmadata and their UI. Almost all of the interesting data businesses out there are 80% or more built using Public data that has been standardized, mined, and melded to fit a specific end user case. So if we can do that work for the world, we have done the world a service making data businesses more efficient.
Good luck with your data business, keep listening to your users!