CS 201: Jon Postel Distinguished Lecture: DeepDive and Snorkel: Dark Data Systems to Answer Macroscopic Questions, CHRISTOPHER RÉ, Stanford University

RE-CS201-PIC

ABSTRACT: Building applications that can read and analyze a wide variety of data may change the way we do science, make business decisions, and develop policy. However, building such applications is challenging: real world data is expressed in natural language, images, or other “dark” data formats which are fraught with imprecision and ambiguity and so are difficult for machines to understand. This talk describes DeepDive, a new type of system designed to cope with Dark Data by combining extraction, integration and prediction into one system. For some paleobiology and materials science tasks, DeepDive-based systems have surpassed human volunteers in quantity and quality (recall and precision) of extracted information. DeepDive is in daily use by scientists in areas including genomics and drug repurposing, by a number of companies involved in various forms of search, and by law enforcement in the fight against human trafficking. This talk will also describe Snorkel, whose goal is to make routine Dark Data tasks dramatically easier. At its core, Snorkel focuses on a key bottleneck in the development of machine learning systems: the lack of large training datasets. In Snorkel, a user implicitly creates large training sets by writing simple programs that label data, instead of performing manual feature engineering or tedious hand-labeling of individual data items. We’ll describe our preliminary evidence that the Snorkel approach allows a broader set of users to write dark data programs more efficiently than previous approaches. We will also describe the underlying theory, in particular our recent work on new convergence guarantees for Gibbs sampling and large-scale non-convex optimization which play a key role in enabling Snorkel to scale. DeepDive and Snorkel are open source on github and available from DeepDive.Stanford.Edu and Snorkel.Stanford.edu. BIO:  Christopher (Chris) Ré is an assistant professor in the Department of Computer Science at Stanford University and a Robert N. Noyce Family Faculty Scholar. His work’s goal is to enable users and developers to build applications that more deeply understand and exploit data. Chris received his PhD from the University of Washington in Seattle under the supervision of Dan Suciu. For his PhD work in probabilistic data management, Chris received the SIGMOD 2010 Jim Gray Dissertation Award. He then spent four wonderful years on the faculty of the University of Wisconsin, Madison, before moving to Stanford in 2013. He helped discover the first join algorithm with worst-case optimal running time, which won the best paper at PODS 2012. He also helped develop a framework for feature engineering that won the best paper at SIGMOD 2014.

Hosted by Professor Carlo Zaniolo

REFRESHMENTS at 3:45 pm, SPEAKER at 4:15 pm

VIDEO TAPED LECTURE:

Date/Time:
Date(s) - Jan 12, 2017
4:15 pm - 5:45 pm

Location:
3400 Boelter Hall
420 Westwood Plaza Los Angeles California 90095