Title:Facilitating Programming for Data Science via DSLs and Machine Learning
Date: September 19, 2019 (Thursday)
Time: 2:30 pm - 3:30 pm
Venue: Room 121, 1/F, Ho Sin-Hang Engineering Building, The Chinese University of Hong Kong, Shatin, N.T.
Speaker: Prof. Artur Andrzejak
University of Heidelberg Germany


Data processing and analysis becomes relevant for a growing number of domains and applications, ranging from natural science to industrial applications. Given the variety of scenarios and the need for flexibility, each project typically require custom programming. This task might pose a challenge for the domain specialists (typically non-developers), and frequently becomes a major cost and time factor in crafting a solution. This problem even aggravates if performance or scalability are important, due to increased complexity of developing parallel/distributed software.

This talk focuses on selected solutions of these challenges. In particular, we will discuss a tool NLDSL [1] for accelerated implementation of Domain Specific Languages (DSLs) for libraries following the "fluent interface" programming model. We showcase how this solution facilitates script development in context of popular data science frameworks/libraries like (Python) Pandas, scikit-learn, Apache Spark, or Matplotlib. The key elements are "no overhead" integration of DSL and Python code, DLS-level code recommendations, and support for adding ad-hoc DSL elements tailored to even small application domains.

We will also discuss solutions utilizing machine learning. One of them are code fragment recommenders. Here frequently used code fragments (snippets) are extracted from Stackoveflow/GitHub, generified, and stored in a database. During development they are recommended to users based on textual queries, selection of relevant data, user interaction history, and other inputs.

Another work attempts to combine the approach for Python code completion via neural attention and pointer networks by Jian Li et al. [2] with probabilistic models for code [3]. Our study shows some promising improvement of accuracy. 

If time permits, we will also take a quick look at alternative approaches for accelerated programming in context of data analysis: natural language interfaces for code development (e.g. bots), and the emerging technologies for program synthesis.


[1] Artur Andrzejak, Kevin Kiefer, Diego Costa, Oliver Wenz, Agile Construction of Data Science DSLs (Tool Demo), ACM SIGPLAN Int. Conf. on Generative Programming: Concepts & Experiences (GPCE), 21-22 October 2019, Athens, Greece.

[2] Jian Li, Yue Wang, Michael R. Lyu, and Irwin King, Code completion with neural attention and pointer networks. In Proc. 27th International Joint Conference on Artificial Intelligence (IJCAI'18), 2018, AAAI Press.

[3] Pavol Bielik, Veselin Raychev, and Martin Vechev. PHOG: Probabilistic model for code. In Prof. 33rd International Conference on Machine Learning, 20–22 June 2016, New York, USA.

Speaker’s Bio: 

Artur Andrzejak has received a PhD degree in computer science from ETH Zurich in 2000 and a habilitation degree from FU Berlin in 2009. He was a postdoctoral researcher at the HP Labs Palo Alto from 2001 to 2002 and a researcher at ZIB Berlin from 2003 to 2010. He was leading the CoreGRID Institute on System Architecture (2004 to 2006) and acted as a Deputy Head of Data Mining Department at I2R Singapore in 2010. Since 2010 he is a W3-professor at University of Heidelberg and leads there the Parallel and Distributed Systems group. His research interests include scalable data analysis, reliability of complex software systems, and cloud computing. To find out more about his research group, visit http://pvs.ifi.uni-heidelberg.de/.

Enquiries: Ms. Shirley Lau at tel. 3943 8439

For more information, please refer to http://www.cse.cuhk.edu.hk/en/events