Data Sources for Data Science – Part 1


[Part 2 is here]

With very few exceptions, data scientists need access to large numbers of well-specified data sets in order to learn, practice, and grow. The original intent of this blog was to survey the internet for large, high-quality repositories of such data which could be downloaded and used without restriction; but, I abandoned that idea as soon as I did my first Google search!  Things certainly have changed in the past few years, and anyone with internet connection has access to just about every type of data you might wish to study. How did this change, and why?

The popularity of the internet exploded with the advent of the World Wide Web in 1989 – and for those of us who remember those days, one of the major challenges was figuring out what was out there.  At first, we shared URLs by email, or one website might list other websites, but there was no single, authoritative map of resources on the web. The first efforts at ‘mapping the web’ came in the form of manually curated catalogs of websites, beginning with Tim Berners-Lee’s Worldwide Web Virtual Library (now defunct) – and while none of these early efforts at indexing the web survive in their original form today, some attained considerable success. The hierarchical catalog ‘Jerry and David’s Guide to the World Wide Web’, for example, is better known now as Yahoo. The demise of the manually-curated catalog came because curation simply did not scale as the Web expanded into the pervasive global information exchange we know today. The scalable solution we all came to adopt was to split the problem into two pieces: first, to exhaustively explore the web and index its content, and second, to search this index in a way which identifies the most relevant entries to return to the user – a modern search engine.

The first of these problems plays to the strengths of a computer because it is repetitive and follows a clear set of rules, but the second task, interpreting a user request and identifying the best response, is met by an algorithm such as Google Page Rank. Given the complexity of human language, and the difficulty of understanding the context and intent of the user, this task will always be best handled by a human if the scalability problem could be solved – and indeed, Page Rank effectively relies on human judgment because the number of links to a page; and, therefore, the knowledge of the author of a web page is used to determine page rank.

I used one of these modern search engines to see what resources are available – and immediately I found manually curated lists. Excellent examples of such web catalogs for data science include those maintained by the Visual Geometry Group at Oxford University and the UCI Machine Learning Repository, containing classic data such as Fisher’s Irises. These data resources may lack volume, but compensate for it in terms of quality. This is because they have been assembled as collections by human domain experts.

Catalogs, however, were just the tip of the iceberg.  Searches for different data types quickly identified substantial resources without the need to search a central catalog- examples include repositories of genomic data, climate data, raw physics data from CERN, Satellite images data, even telemetry from the SpaceX Falcon-9 rocket!

While much of this data was available as files downloadable from web pages, we should note that substantial information, especially in domains which have been generating large datasets for some time, is available using FTP.  In addition, various UNIX tools are available to directly access and download a dataset – examples include Homebrew and Git. Intended for version control and package management, these tools can be used to manage data just as well as code, or downloaders such as curl and wget can be used to simplify and automate data access.  The providers of some data resources add further constraints – for example permitting access only through their own API – examples of this include Wikipedia, Twitter, and the competition website Kaggle, which permits registered users to access competition data in various ways, including the kaggle-cli Python module. Sensitive data may be restricted further – for example, you can download and analyze telemetry from your electric car using the myCarma application, but it is not clear the extent to which this is authorized by the vehicle manufacturers.

Healthcare data is notoriously hard to obtain due to privacy issues of personal confidentiality, but this has also been mitigated in recent years. The Synapse System may be used to store data securely, and has an inbuilt validation process to ensure medical information is only accessible for authorized purposes; steps include providing physical proof of identity and obtaining an ORCID digital identifier. An example of a dataset available in this way is the Mobile Parkinson’s Disease Study, mPower, which uses a freely-available iPhone application to record patient data, and has been used by over 15,000 people.

Even streaming data is widely available, and streams or representative samples of streamed data may be obtained through subscription services (often with free time-delayed counterparts), for example stock quotes and currency prices – but also real-time feeds from diverse internet-connected sources such as webcams, sensor networks, and the increasing range of hardware of the Internet of Things. For example, real-time Point of Sale data collected via cellphone or iPad are available online to authorized users through services such as Square. These services frequently bundle analytical tools with the data.