13.5TB dataset to help advance innovation in computer science
SUNNYVALE, Calif.--(BUSINESS WIRE)--
Yahoo Inc. (NASDAQ: YHOO)
today announced the public release of the largest-ever machine learning
dataset to the academic research community. With this release, the
company aims to advance the field of large-scale machine learning and
recommender systems, and to help level the playing field between
industrial and academic research.
"Many academic researchers and data scientists don't have access to
truly large-scale datasets because it is traditionally a privilege
reserved for large companies," said Suju Rajan, director of research,
Yahoo Labs. "We are releasing this dataset for independent researchers
because we value open and collaborative relationships with our academic
colleagues, and are always looking to advance the state-of-the-art in
machine learning and recommender systems."
The Yahoo
News Feed dataset is a collection based on a sample of anonymized
user interactions on the news feeds of several Yahoo properties,
including the Yahoo homepage, Yahoo News, Yahoo Sports, Yahoo Finance,
Yahoo Movies, and Yahoo Real Estate. The dataset stands at a massive
~110B events (13.5TB uncompressed) of user-news item interaction data,
collected by recording the user-item interactions of about 20M users
from February 2015 to May 2015.
"Yahoo's release of the Yahoo News Feed dataset is a significant
contribution to the research community. Academic researchers everywhere
will finally have access to realistic scale data to study how to
automatically discover which news articles are of interest to which
users, and will be able to compare their methods using this as a shared
test case," said Tom Mitchell, machine learning department chair,
Carnegie Mellon University. "Here at CMU we'll certainly be using it for
our research."
The dataset provides categorized demographic information (age range,
gender, and generalized geographic data) for a subset of the anonymized
users. On the item side, the title, summary and key-phrases of the news
article in question are also included, and interaction data is
timestamped with the user's local time and also contains partial
information of the device used to access the news feeds.
"Access to datasets of this size is essential to design and develop
machine learning algorithms and technology that scales to truly ‘big'
data," said Gert Lanckriet, professor, Department of Electrical and
Computer Engineering, University of California, San Diego. "At the
Jacobs School of Engineering at UC San Diego, it will directly and
significantly benefit the wide variety of ongoing research in machine
learning, artificial intelligence, information retrieval, and big data
applications."
"At the UMass Amherst Center for Data Science we have broad interests in
developing new methods for scalable analytics on a wide variety of
big-data domains,"said Andrew McCallum, director of the Center and
professor in the College of Information and Computer Sciences. "The
release of this large Yahoo News Feed dataset will be a tremendous asset
for the academic research community, and for us at UMass particularly,
given our major research activities in natural language processing,
information retrieval, databases and computational social science."
About the Webscope program:
The dataset is available as part of the Yahoo LabsWebscope
data-sharing program, which is a reference library of
scientifically-useful datasets comprised of anonymized user data for
non-commercial use. The dataset we are releasing today is governed by
our commitment to safeguard our users' privacy and follows our practice
of protecting and anonymizing user data.
About Yahoo Labs:
Yahoo Labs is the scientific engine guiding Yahoo innovation while
powering impactful products for Yahoo's users, partners, and
advertisers. Yahoo Labs serves as Yahoo's research arm-its incubator for
bold new ideas and laboratory for rigorous experimentation. Yahoo Labs
applies its scientific findings in powering products for Yahoo's users
and enhancing value for its partners and advertisers. Yahoo Labs'
forward-looking innovation also helps position Yahoo as an industry and
scientific thought leader. For more information, visit labs.yahoo.com
or Yahoo Labs' blog (yahoolabs.tumblr.com).
About Yahoo:
Yahoo is a guide focused on informing, connecting, and entertaining our
users. By creating highly personalized experiences for our users, we
keep people connected to what matters most to them, across devices and
around the world. In turn, we create value for advertisers by connecting
them with the audiences that build their businesses. Yahoo is
headquartered in Sunnyvale, California, and has offices located
throughout the Americas, Asia Pacific (APAC) and the Europe, Middle East
and Africa (EMEA) regions. For more information, visit the pressroom (pressroom.yahoo.net)
or the Company's blog (yahoo.tumblr.com).
![](http://cts.businesswire.com/ct/CT?id=bwnews&sty=20160114005200r1&sid=acqr7&distro=nx&lang=en)
View source version on businesswire.com: http://www.businesswire.com/news/home/20160114005200/en/
Yahoo Inc.
Fred Han, 415-713-1562
fredh@yahoo-inc.com
Source: Yahoo Inc.
News Provided by Acquire Media