Prof. Yufei Tao Received PODS 2018 Best Paper Award

Congratulations to Prof. Yufei Tao for receiving PODS 2018 Best Paper Award, for the paper titled “Entity Matching with Active Monotone Classification.”  This is a single-authored paper.

The annual ACM SIGMOD/PODS Conference is a leading international forum for database researchers, practitioners, developers, and users to explore cutting-edge ideas and results, and to exchange techniques, tools, and experiences. The conference includes a fascinating technical program with research and industrial talks, tutorials, demos, and focused workshops. It also hosts a poster session to learn about innovative technology, an industrial exhibition to meet companies and publishers, and a careers-in-industry panel with representatives from leading companies.

Given two sets of entities X and Y, entity matching aims to decide whether x and y represent the same entity for each pair (x, y) ∈ X × Y. As the last resort, human experts can be called upon to inspect every (x, y), but this is expensive because the correct verdict could not be determined without investigation efforts dedicated specifically to the two entities x and y involved. It is therefore important to design an algorithm that asks humans to look at only some pairs, and renders the verdicts on the other pairs automatically with good accuracy. At the core of most (if not all) existing approaches is the following classification problem. The input is a set P of points in Rd, each of which carries a binary label: 0 or 1. A classifier F is a function from Rd to {0,1}. The objective is to find a classifier that captures the labels of a large number of points in P.

In this paper, we cast the problem as an instance of active learning where the goal is to learn a monotone classifier F, namely, F(p) ≥ F(q) holds whenever the coordinate of p is at least that of q on all dimensions. In our formulation, the labels of all points in P are hidden at the beginning. An algorithm A can invoke an oracle, which discloses the label of a point p ∈ P chosen by A. The algorithm may do so repetitively, until it has garnered enough information to produce F. The cost of A is the number of times that the oracle is called. The challenge is to strike a good balance between the cost and the accuracy of the classifier produced. We describe algorithms with non trivial guarantees on the cost and accuracy simultaneously. We also prove lower bounds that establish the asymptotic optimality of our solutions for a wide range of parameters.