Data Sources for Data Science – Part 2


[This is part 2, to read part 1 please click here]

Data Scientists need access to diverse, high-quality sources of data in order to practice their trade, and in my previous blog I compared finding these resources to the early days of the internet, when the volume of data caused human-curated approaches to give way to the semi-automated parallel approach of the internet search engine. This transition has revealed the truth – we are drowning in data, and the problem has transitioned from a scarcity of available resources to comprehending the daunting array of traditional and non-traditional information located by a simple search.

My previous blog focused mainly on data available in a delimited text formats – but in any survey of available data types must also consider relational databases. The relational model may be used to store any type of data, but in science especially, relational systems have been used to store curated data. The data in such a database would be obtained, input and edited manually – a scientific example being the Genome Database, GDB. Now defunct, the majority of the data in GDB was entered by committees of experts at specially-convened conferences. The tradeoffs of such data are complex: on the one hand, manual curation means the dataset size is not large, but we do know each entry has been evaluated by a human curator. Another example, the ENSEMBL data are available via public relational database, and this dataset includes data manually curated by experts.

While ENSEMBL was growing, many areas of science saw a revolution in the parallelism of experimentation – for example a DNA sequencing machine in the mid-1990s might determine the sequence of 30,000 letters of genetic code in a day, by 2010 this number was over one billion – and the increasing trend towards automation and parallelism of experimentation is changing the nature, cost and availability of large raw datasets, for example the Sloan Digital Sky Survey, which, as with so many other projects across the internet, makes its data available for public download.

So – what has been happening to the world of data? Not so long ago large, high-quality datasets were hard to find online, and this gave rise to web catalogs directing the enquirer to the few available resources – but the widespread availability of data we now enjoy points to a change in the way people have come to think about data. Sharing has long been regarded as a theoretically best practice in science, for example the NIH Data Sharing Policy of 2003 mandates all recipients of grants make a plan for sharing their results – yet implementation has been left up to the data owners – often academics heavily dependent upon publication for their own career advancement. Incentives exist for researchers to maintain a tight control over their data, and this tension between data sharing and personal advancement has been the subject of much debate – the views of Longo and Drazen are typical – data sharing should ‘happen symbiotically and not parasitically’ – implying the creators of a dataset would maintain control over the use of their data, even though it was placed into the public domain. In recent years however, we are seeing a change in attitude as automation and parallelism reduce the cost of data generation past the point where data ownership is worth fighting about.

We also see local and national governments making data available as a civic duty to promote transparency; US data sources of this type are available at data.gov. We are also seeing the return of the venerable data catalog, and indeed an increasing awareness of the value of human curation rather than page ranking algorithms to steer us to the best quality information. The new catalogs take advantage of modern version control to build a federated system to spread curation among teams of people, with versioning and replication at its heart. An example of this being the Awesome system of federated lists, which includes lists for access to datasets.

As is often the case in the world of technology, our choice of solution has come full circle – the availability of diverse, accurate, and cheap data may have defeated individual curation efforts in favor of search engines, but that same ubiquity of data argues for a return to the context and intelligence of human curation, and federated catalogs are emerging to serve this purpose. Will we see a swing back to search engines?  Absolutely! As noted, web searching requires creation of an index, a step ideally suited to computers, followed by evaluation of index entries in the light of a question (a step best performed by humans), but surely, we will see data science coming to the rescue in time, creating algorithms that can evaluate the quality of a dataset as well as an experienced human curator, tipping the balance in favor of search engines once more.

There are some examples just starting to take shape out there but I won’t give you links here – why don’t you find them with a search engine?