====== CSCI5510 Big Data Analytics ====== ==== Breaking News ==== * **September 3, 2013**. The course homepage is migrated to https://www.cse.cuhk.edu.hk/csci5510/wiki/ permanently. * **September 2, 2013**. The new semester begins. * **September 2, 2013**. News group address: cuhk.cse.csci5510 * **September 2, 2013**. The first tutorial will be conducted on Sept. 10. There is no tutorial in the first week. * **September 3, 2013**. The tutorial class room is YIA LT7. ===== 20013-14 Term 1 ===== | ^ Lecture ^ Tutorial ^ ^ Time | M2-4, 9:30 am - 12:30 pm | T3 10:30 am - 11:15 am | ^ Venue | KKB101 | YIA LT7 | The Golden Rule of CSCI5510: No member of the CSCI5510 community shall take unfair advantage of any other member of the CSCI5510 community. ====== Course Description ====== This course aims at teaching students the state-of-the-art big data analytics, including techniques, software, applications, and perspectives with massive data. The class will cover, but not be limited to, the following topics: distributed file systems such as Google File System, Hadoop Distributed File System, CloudStore, and map-reduce technology; similarity search techniques for big data such as minhash, locality-sensitive hashing; specialized processing and algorithms for data streams; big data search and query technology; big graph analysis; recommendation systems for Web applications. The applications may involve business applications such as online marketing, computational advertising, location-based services, social networks, recommender systems, healthcare services, also covered are scientific and astrophysics applications such as environmental sensor applications, nebula search and query, etc. 本課程旨在教導學生最先進的針對大數據的分析,包括技術、軟件、應用和遠景。本課程內容將包括,但不限於以下內容:分佈式文件系統如谷歌文件系統,Hadoop文件系統,CloudStore等和Map-reduce技術;大數據的相似搜索技術,如最小哈希,局部敏感哈希等;針對數據流的專門處理方法和算法;大數據的搜索和查詢技術;互聯網應用中的廣告管理和推薦系統。本課涉及的應用程序可能包括商業應用程序,如網絡營銷、計算廣告、基於位置的服務、社交網絡、推薦系統、醫療保健服務和科學及天體物理學領域的應用,如環境傳感器的應用,星雲搜索和查詢等。 ===== Learning Objectives ===== - To understand the current key issues on big data and the associated business/scientific data applications - To teach the fundamental techniques and principles in achieving big data analytics with scalability and streaming capability - To interpret business models and scientific computing results - Able to apply software tools for big data analytics ===== Learning Outcomes ===== At the end of the course of studies, students will have acquired the ability to - Understand the key issues on big data and the associated applications in intelligent business and scientific computing. - Acquire fundamental enabling techniques and scalable algorithms in big data analytics. - Interpret business models and scientific computing paradigms, and apply software tools for big data analytics. - Achieve adequate perspectives of big data analytics in marketing, financial services, health services, social networking, astrophysics exploration, and environmental sensor applications, etc. ===== Learning Activities ===== - Lectures - Tutorials - Web resources - Projects - Presentations - Lab Reports - Examinations ====== Personnel ====== | ^ Lecturer ^ Lecturer ^ Tutor ^ Tutor ^ ^ Name | [[https://www.cse.cuhk.edu.hk/irwin.king/home|Irwin King]] | [[http://www.cse.cuhk.edu.hk/~lyu|Michael R. Lyu]] | Guang Ling | Chen Cheng | ^ Email | king AT cse.cuhk.edu.hk | lyu AT cse.cuhk.edu.hk | gling AT cse.cuhk.edu.hk | ccheng AT cse.cuhk.edu.hk | ^ Office | Rm 908 | Rm 927 | Rm 1024 | Rm 1024 | ^ Telephone | 3943 8398 | 3943 8429 | 3943 4252 | 3943 4252 | ^ Office Hour(s) | TBA | 10:00-12:00 Tuesday | TBA | TBA | Note: This class will be taught in English. Homework assignments and examinations will be conducted in English. ====== Syllabus ====== The pdf files are created in Acrobat 6.0. Please obtain the correct version of the [[http://www.adobe.com/prodindex/acrobat/readstep.html#reader | Acrobat Reader]] from Adobe. ^ Week ^ Date ^ Topics ^ Tutorials ^ Homework & Events ^ Resources ^ | 1 | 2/9 | Introduction and Motivation \\ \\ {{:teaching:csci5510:01.pptx|}} | No Tutorial | | [[http://infolab.stanford.edu/~ullman/mmds/ch1.pdf|Ch. 1 of MMDS]] | | 2 | 9/9 | MapReduce\\ \\ [[|02-MapReduce.pdf]] | \\ \\ | \\ \\ | [[http://infolab.stanford.edu/~ullman/mmds/ch2.pdf|Ch. 2 of MMDS]] \\ [[http://infolab.stanford.edu/~ullman/mmds/ch6.pdf|Ch. 6 of MMDS]] | | 3 | 16/9 | Locality Sensitive Hashing\\ \\ [[|03-lsh.pdf]] | \\ \\ | | [[http://infolab.stanford.edu/~ullman/mmds/ch3.pdf|Ch. 3 of MMDS]] | | 4 | 23/9 | Mining Data Streams\\ \\ [[|04-stream.pdf]] | | | [[http://infolab.stanford.edu/~ullman/mmds/ch4.pdf|Ch. 4 of MMDS]] | | 5 | 30/9 | Scalable Clustering \\ \\ [[|05-clustering.pdf]] | | | [[http://infolab.stanford.edu/~ullman/mmds/ch7.pdf|Ch. 7 of MMDS]] | | 6 | 7/10 | Dimensionality Reduction \\ \\ [[|06-DR.pdf]] | | | [[http://infolab.stanford.edu/~ullman/mmds/ch11.pdf|Ch. 11 of MMDS]] | | 7 | 14/10 | Public Holiday | | | | | 8 | 21/10 | Recommender systems/Matrix Factorization \\ \\ [[|07-mf.pdf]] | | | [[http://infolab.stanford.edu/~ullman/mmds/ch9.pdf|Ch. 9 of MMDS]] | | 9 | 28/10 | Massive Link Analysis \\ \\ [[|08-link.pdf]] | | | [[http://infolab.stanford.edu/~ullman/mmds/ch5.pdf|Ch. 5 of MMDS]] | | 10 | 4/11 | Mid-term | | | | | 11 | 11/11 | Analysis of Massive Graph \\ \\ [[|09-graph.pdf]] | | | [[http://infolab.stanford.edu/~ullman/mmds/ch10.pdf|Ch. 10 of MMDS]] | | 12 | 18/11 | Large Scale SVM\\ \\ [[|10-svm.pdf]] | | \\ | [[http://www.svms.org/tutorials/Burges1998.pdf|SVM tutorial]] | | 13 | 25/11 | Online Learning \\ \\ [[|11-ol.pdf]] | | | [[http://www.cs.huji.ac.il/~shais/papers/OLsurvey.pdf|Online learning survey]] | ====== Class Project ====== ===== Class Project Presentation Schedule ===== * TBA ===== Class Project Presentation Requirements===== ====== Examination Matters ====== ===== Examination Schedule ===== | ^ Time ^ Venue ^ Notes ^ ^ Midterm Examination | Nov. 4, 9:30am-12:00 noon | TBA | TBA | ^ Final Examination | TBA | TBA | TBA | * [[http://rgsntl.rgs.cuhk.edu.hk/rws_prd_life/main1.asp|CUHK Registration and Examination]] ===== Written Midterm Matters ===== - The midterm will test your knowledge of the materials. - Answer all questions using the answer booklet. There will be more available at the venue if needed. - Write legibly. Anything we cannot decipher will be considered incorrect. - One A4-sized cheat-sheet page. ====== Grade Assessment Scheme ====== ^ Homework\\ Assignments ^ Mid-term\\ Examination ^ Project ^ | 20% | 30% | 50% | -Assignments (20%) -Written assignments -Coding -Mid-term Examination (30%) - Project (50%) - Proposal - Presentations - Report ====== Reference Books ====== ====== FAQ ====== - **Q: What is departmental guideline for plagiarism?**\\ A: If a student is found plagiarizing, his/her case will be reported to the Department Discipline Committee. If the case is proven after deliberation, the student will automatically fail the course in which he/she committed plagiarism. The definition of plagiarism includes copying of the whole or parts of written assignments, programming exercises, reports, quiz papers, mid-term examinations. The penalty will apply to both the one who copies the work and the one whose work is being copied, unless the latter can prove his/her work has been copied unwittingly. Furthermore, inclusion of others' works or results without citation in assignments and reports is also regarded as plagiarism with similar penalty to the offender. A student caught plagiarizing during tests or examinations will be reported to the Faculty Office and appropriate disciplinary authorities for further action, in addition to failing the course. ====== Resources ====== -[[http://pajek.imfm.si/doku.php|Pajek, a network analysis and visualization program.]] -[[http://vlado.fmf.uni-lj.si/pub/networks/data/default.htm|Package for Large Network Analysis]] -[[http://www.analytictech.com/downloaduc6.htm|UCINET 6]] -[[http://www.analytictech.com/Netdraw/netdraw.htm|Netdraw]] -[[http://stat.gamma.rug.nl/stocnet/|StOCNET]] ===== Big Data Analytics ===== * http://infolab.stanford.edu/~ullman/mmds.html \\ * http://cs246.stanford.edu/ \\ ===== Graph Mining ===== * http://www.cs.cmu.edu/~deepay/mywww/papers/csur06.pdf \\ * http://cs.stanford.edu/people/jure/talks/www08tutorial/ \\ * http://www.xifengyan.net/tutorial/KDD08_graph_partI.pdf \\ * http://www.xifengyan.net/tutorial/KDD08_graph_partII.pdf ===== Link Analysis===== * http://analytics.ijs.si/events/Tutorial-TextMiningLinkAnalysis-KDD2007-SanJose-Aug2007/ \\ * http://www.sigkdd.org/explorations/issues/7-2-2005-12/1-Getoor.pdf \\ * http://www.ncjrs.gov/pdffiles1/nij/grants/219552.pdf \\ * http://delab.csd.auth.gr/~dimitris/papers/ENVO07LARskm.pdf ===== Learning to Rank===== * http://www2009.org/pdf/T7A-LEARNING%20TO%20RANK%20TUTORIAL.pdf\\ * http://radlinski.org/papers/LearningToRank_NESCAI08.pdf\\ * http://www.aclweb.org/anthology/P/P09/P09-5005.pdf\\ * http://www.cse.iitb.ac.in/~soumen/doc/www2007/TutorialSlides.pdf ===== Recommender Systems===== * http://en.wikipedia.org/wiki/Recommender_system * http://www.deitel.com/ResourceCenters/Web20/RecommenderSystems/RecommenderSystemsTutorialsandWebcasts/tabid/1313/Default.aspx * http://www.computer.org/portal/web/csdl/doi/10.1109/TKDE.2005.99 * http://www.springerlink.com/content/n881136032u8k111/ * http://www.csd.abdn.ac.uk/~jmasthof/Publications/WPRSIUI07.pdf ===== Human Computation/Social Games ===== * http://www.gwap.com/gwap/ \\ * http://www.cs.cmu.edu/~biglou/ \\ ===== Opinion Mining/Sentiment Analysis ===== * http://www.cs.uic.edu/~liub/FBS/opinion-mining-sentiment-analysis.pdf \\ * http://www.cs.cornell.edu/home/llee/omsa/omsa-published.pdf \\ * http://www.cs.cmu.edu/~wcohen/10-802/sentiment-sep-4.ppt \\ ===== Visualization ===== -[[http://manyeyes.alphaworks.ibm.com/manyeyes/|Many Eyes Visualization]] ===== Programming ===== -[[http://networkx.lanl.gov/|NetworkX, a Python package for complex networks]] -[[http://www.wolfram.com/|Mathematica from Wolfram]] -[[http://demonstrations.wolfram.com/|Wolfram Demonstrations]]