Data Central
Official news from HP

Big data deserve a bigger audience – HP Labs on the peer review of data analysis

Scientific analysis of data produced by users of social media sites has world-changing potential.  By understanding patterns in activity, these analyses can improve everything from business to education to the overall human condition.


But because the data is often proprietary, the scientific process of peer-review is beginning to break down, potentially leading to troubling results. 


In a letter published this week by the premier science journal Nature, Bernardo Huberman, the Director of the HP Labs Social Computing Group, shines a light on the issue, noting “many of the big data results that are coming out are obtained from private sources that are not accessible to researchers beyond the authors of the work.”

 

“Even worse, in some cases the source of the data itself remains hidden, leading not only to problems of verification but also about the generality of the results,” he says.

 

(Bear in mind that Huberman’s reference to “private” should be interpreted as “owned by a certain corporation,” not “an individual’s personal information.”  The point is that meaningful, verifiable results can still be gleaned from data that is aggregated and anonymous, rather than individual and personally identifiable).


Dr. Huberman continues:


More importantly, we need to recognize that these results will only be meaningful if they are universal, in the sense that many other data sets reveal the same behavior. This actually uncovers a deeper problem. If another set of data does not validate results obtained with private data, how do we know if it is because they are not universal or the authors made a mistake?


Huberman and the HP Labs Social Computing Research Group have published dozens of scientific research experiments based on publicly available data from services as diverse as Twitter, YouTube, Digg, Wikipedia, and Gnutella.

 

Their latest study, "The Pulse of News in Social Media: Forecasting Popularity," was recently featured in Technology Review and The Atlantic.


Read the full letter:


The interactive nature of the web has created research opportunities which have been highlighted in this journal and exploited by a number of researchers from the social and information sciences. Patterns that were hard to discern when operating with limited data sets have become apparent as enormous repositories of data collected by large services such as Twitter, Facebook, and Google are accessed by researchers and business professionals.


There is however a serious problem with many of these studies.  As recently re-iterated in this journal by Ravetz (Nature 481, 25 (2012)), Science is unique in that peer review, publication and replication are essential to its progress. And yet, many of the big data results that are coming out are obtained from private sources that are not accessible to researchers beyond the authors of the work. Even worse, in some cases the source of the data itself remains hidden, leading not only to problems of verification but also about the generality of the results.  While ideally one would like to have the authors share the data, at least these data sources should be accessible to others to verify the findings.  This is common practice within the physical and biological communities.


More importantly, we need to recognize that these results will only be meaningful if they are universal, in the sense that many other data sets reveal the same behavior. This actually uncovers a deeper problem. If another set of data does not validate results obtained with private data, how do we know if it is because they are not universal or the authors made a mistake? Moreover, as many practitioners of social network research are starting to discover, many of the results are becoming part of a “cabinet de curiosites” devoid of much generality and hard to falsify.


Besides the potential for fraud, if this trend continues we'll see a small group of scientists with access to private data repositories enjoy an unfair amount of attention in the community at the expense of equally talented researchers whose only flaw is the lack of right "connections" to private data.

 

Bernardo A. Huberman

Director, Social Computing Group

HP Labs


 

Editor’s note: for another analysis of this emerging topic, read “Why Facebook’s data sharing matters,” by Marshall Kirkpatrick in Read/Write Web.

Labels: HP labs
Search
About the Author


Follow Us
Guidelines

Data Central is the official HP corporate blog, brought to you by the corporate communications team in Palo Alto. Before commenting, please read our community guidelines. For more news and press contacts, visit the HP newsroom. Note: all times GMT

Blogroll