Xiaofeng YU(余晓峰)

Postdoctoral Research Fellow (March 2011 ~ )

Department of Computer Science & Engineering
The Chinese University of Hong Kong

Rm 910, Ho Sin-Hang Engineering Building,
Department of Computer Science and Engineering,
CUHK, Shatin, N.T., Hong Kong
Research Interests

  • Text mining, information extraction, and natural language processing
  • Information retrieval, Web search and Web data mining
  • Machine learning and artificial intelligence


Research Projects

RGC Funded Projects

During my Ph.D. training, I mainly work on several funded research projects, including:

Project Code Project Title Amount (HK$'000) Funder
CUHK413510 Incorporating Non-local Interactions and Logical Inference into Sequence Classification Model for Practical Text Mining 1646.168 RGC
CUHK4128/07 A Framework for Cooperative Information Extraction and Relation Learning From Texts 391.776 RGC

Other Research Projects

Moreover, I have contributed and participated in several professional and well-known international evaluations and shared tasks, including:

TREC 2010

I have participated in the entity track of TREC 2010. The goal of this new track is to perform entity-related search on the World Wide Web (return a ranked list of entities of a specified type that engage in a given relationship with a given source entity). Since many user information needs would be better answered by specific entities instead of just any type of documents.

  • Project leader and chief system architect.
  • Designed and developed the homepage identification component based on machine learning techniques, and discussed several efficient features exploited.
  • Designed and researched on target entity finding component.
  • Designed and developed other major component, such as webpage filtering, source entity identification, entity homepage and document id mapping, etc.


In SIGHAN-6, among all the 23 groups participating the official evaluation, our group obtained the best performance on the CityU corpus and the fourth place on the MSRA corpus. Moreover, we were the only group that obtained consistently over 90 F-measure on all the benchmark corpora in the NER open track.

  • Project leader and chief system architect.
  • Designed and implemented the NER system based on probabilistic graphical models with first-order logic.
  • Investigated the use of first-order logic and computational linguistics (e.g., domain knowledge) to improve the system performance.


I have participated in the Chinese named entity recognition (NER) shared task of the third SIGHAN Chinese language processing bakeoff (SIGHAN-5), which provides large-scale benchmark data for evaluation. Our system employed boosting technique. Even though we did no other Chinese-specific tuning, and used only one-third of the MSRA and CityU corpora to train the system, reasonable results are obtained.

  • Project leader and major investigator.
  • Designed and implemented the Chinese NER system.
  • Researched on exploiting machine learning technique — boosting for Chinese NER problem, and compared with other algorithms such as support vector machines and maximum entropy models.


I have participated in the Senseval-3 WSD evaluation, which was organized by ACL-SIGLEX and in conjunction with ACL 2004. Senseval-3 included 14 different tasks for core word sense disambiguation, as well as identification of semantic roles, multilingual annotations, logic forms, sub-categorization acquisition.

  • Designed and implemented a toolkit (GUI) for word sense selection.
  • Provided benchmark testing dataset for this evaluation.

NIST-MT 2004

I have participated in the 2004 NIST machine translation (MT) evaluation. As part of the DARPA TIDES program, the objective of the NIST MT evaluation series is to support research in, and help advance the state-of-the-art of machine translation technologies.

  • Pre-processed and word-aligned the NIST-MT 2004 corpus (bilingual and parallel corpus with a size of 1.9GB).
  • Researched on bilingual semantic lexicon construction, and compared six semantic similarity measures to enhance the lexicon quality.

National 863 and NSFC Projects

Translation optimization in CEMT2K translation system, word sense disambiguation (WSD) based on bilingual information, automatic building of bilingual semantic lexicons for translation selection, etc.

  • Optimized translation rules to improve system performance.
  • Researched on automatic acquisition of translation knowledge and translation rules.

Workshop Program Committee Member

  • The 5th SIGHAN Workshop on Chinese Language Processing (SIGHAN-5, 2006)
  • The 6th SIGHAN Workshop on Chinese Language Processing (SIGHAN-6, 2008)

Journal Reviewer

  • ACM Transactions on Knowledge Discovery from Data (TKDD)
  • ACM Transactions on Information Systems (TOIS)
  • IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)
  • International Journal of Information Processing and Management (IJIPM)
  • IEEE Transactions on Knowledge and Data Engineering (TKDE)
  • Journal of Information Retrieval (IR)

Conference Reviewer

  • ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2008, 2009, 2010)
  • ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2007, 2008, 2009)
  • The International World Wide Web Conference (WWW 2008, 2009)
  • ACM Conference on Web Search and Data Mining (WSDM 2009)
  • The International Conference on Empirical Methods on Natural Language Processing (EMNLP 2008)
  • ACM International Conference on Information and Knowledge Management (CIKM 2007, 2008, 2009, 2010)
  • IEEE International Conference on Data Mining (ICDM 2007, 2009)
  • SIAM Conference on Data Mining (SDM 2008)
  • The Asia Information Retrieval Symposium (AIRS 2008, 2009)
  • The International Conference on the Computer Processing of Oriental Languages (ICCPOL 2007, 2008)
  • The International Conference on Machine Learning and Cybernetics (ICMLC 2008)

Working Experiences

  • Sep 2005 - Jan 2007, Research Assistant, Dept. of Computer Science & Engineering, The Hong Kong University of Science & Technology
  • July 2007 - Oct 2010, Teaching Assistant, Dept. of Systems Engineering & Engineering Management, The Chinese University of Hong Kong

Teaching Assistants

Spring 2010 CSC 2100E/F (Data Structures) Fall 2009 SEG 3460 (Computer Processing System Concepts)
Spring 2009 SEG 3550 (Fundamentals in Information Systems) Fall 2008 SEG 3460 (Computer Processing System Concepts)
Spring 2008 SEG 3460 (Computer Processing System Concepts) Fall 2007 SEG 3460 (Computer Processing System Concepts)



  • Familiar with object-oriented development in C++, VC++ and Java, familiar with C and Perl
  • Development and analysis on UNIX/Linux, Windows and Solaris systems
  • Knowledge of relational databases such as SQL, knowledge of web-based languages such as HTML, Java Scripts, and XML


  • Motivated, passionate about technology
  • Proactive, self starter, autonomous
  • Good team spirit, good verbal and written communication skills
  • Able to working in an environment with ambiguous and changing requirements

Personal Hobby

Hiking, running, and ping-pong, etc

