CSC 2508 Advanced Data Systems

Fall 2023, Mondays 9-11am, TBD

Class repository: CSC2508 Drive

Class Material Upload repository: CSC2508 Upload Drive

Description

The maturity of several Deep Learning technologies has influenced the design and instigated re-thinking several design principles of data processing systems and architectures. The goal of this course is two-fold. First present a review of the fundamental design components of modern data management architectures including a review of relational and NoSQL systems. Second review and explore how fundamental components can be re-designed by incorporating Deep Learning principles and techniques and explore the resulting (performance and system) implications. In particular this term the course will focus on the redesign of popular information retrieval paradigms utilizing deep learning and large language models.

Class Logistics

Class Format

  • This is a seminar course. Each class will consist of presentations and discussion. Students will be required to do a class project for the course. A significant portion of the grade will be based on class participation, which includes paper presentations, contributions to paper reviews, and paper discussions. Because of the interactive nature of the course, and space limitations, auditing is discouraged.

Prerequisites

  • Background in algorithms, databases, machine learning suggested.

Assignments

  • A research class project: Each of you will work on a research project and file a single submission. The project will be broken down to three assignments: (1) initial research proposal, (2) Intermediate report, (3) final report and presentation
  • Proposed class projects will be described by the instructor. Feel free to discuss your ideas with the instructor and propose your own project. However the project you propose HAS to be associated with the material in the class. This is very important and it is not up for discussion. The project should have a research component. Project ideas will be outlined in class but you are responsible for proposing your project. Some background reading is associated with each project. The project proposal (due date Oct 23) should contain the following information:
    • Topic to be addressed and the nature of the problem
    • State of the art (prior work, what remains unsolved, etc.)
    • The proposed technique to be implemented/evaluated
    • To what degree the project will repeat existing work
    • Specific, measurable goals: deliverables, and dates you expect to produce them
    Project proposals should be a two pages at most. A project status report is due on Nov 20. The status report should include a description of progress to date and what is expected to be accomplished by the final project presentation day.
  • Questions and Responses (QRs): During each class you will need to provide 2 questions in writing for each (mandatory) paper. Questions will be individual assignments. For each class one of you will be responsible to lead a discussion on the papers and provide answers to all questions provided during class. Questions and responses will then be summarized in a written report which will be submitted and shared with everyone in the class.
  • There will be no midterm or final exams.

Final Project Reports

  • Final project reports are due Jan 2 2023
  • Reports should be structured as a research paper, containing an introduction to the problem solved and its importance, summary of related work with citations, a section detailing if the methodology is new and proposed by you or whether it is from the literature, along with citations of any techniques you implemented, a description of the technical developments outlining algorithms and their description, detailed experimental evaluation, conclusions and thoughts on future work and extensions.
  • Essentially your report should mirror the structure of the papers you have been reading during the term

Misc

  • Class time may be adjusted to accomodate external talks releated to the class.

Tentative Lecture Plan (Subject to Change)


Date Topic Reading Material
9/11 Logistics and Review of relational technology
  • Background: Review Chapters 3,4,10,11,13,14,15 from ramakrishnan's book (Database Management Systems) 3rd Edition.
9/18 Overview of Information Retrieval & noSQL
  • Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. 2009 (Chapters 1-4 - https://nlp.stanford.edu/IR-book/)
  • J. Dean and S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, in OSDI 2004.
9/25 Indexing - Background
  • Gionis et. al., Similarity Search in High Dimensions Using Hashing VLDB 1999
  • R-Trees: A Dynamic Index Structure for Spatial Searching, A. Guttman, SIGMOD 1984
  • Roussopoulos et. al., Nearest Neighbor Queries, SIGMOD 1995
  • Learning Multi-dimensional Indexes SIGMOD 2020
10/2 Indexing and Searching I
  • H. Jégou, M. Douze and C. Schmid, "Product Quantization for Nearest Neighbor Search," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 1, pp. 117-128, Jan. 2011
  • T. Ge, K. He, Q. Ke and J. Sun, "Optimized Product Quantization," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 4, pp. 744-755, April 2014
  • A. Babenko and V. Lempitsky, "The inverted multi-index," 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2012
10/16 Indexing and Searching II
  • Y. A. Malkov and D. A. Yashunin, "Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 4, pp. 824-836, 1 April 2020
  • Fu, Cong and Xiang, Chao and Wang, Changxu and Cai, Deng, Fast Approximate Nearest Neighbor Search with the Navigating Spreading-out Graph, VLDB 2019
  • Wang, Mengzhao and Xu, Xiaoliang and Yue, Qiang and Wang, Yuxiang, A Comprehensive Survey and Experimental Comparison of Graph-Based Approximate Nearest Neighbor Search, VLDB 2021
10/23 Embeddings and Transformers
  • Chapter 10, Speech and Language Modelling book (https://web.stanford.edu/~jurafsky/slp3/)
  • Chapter 6, Speech and Language Modelling book
  • The illustrated transformer (https://jalammar.github.io/illustrated-transformer/)
  • Attention is all you need, nips 2017
  • Attention explained (https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/)
10/30 Semantic Search I
  • Dense Passage Retrieval for Open-Domain Question Answering. Vladimir Karpukhin,Barlas Oguz et.al. EMNLP 2020
  • ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. Omar Khattab et.al. SIGIR 2020.
  • Distilling Knowledge from Reader to Retriever for Question Answering. Gautier Izacard, Edouard Grave. ICLR 2020.
  • REALM: Retrieval-Augmented Language Model Pre-Training. Kelvin Guu, Kenton Lee et.al. ICML 2020.
11/13 Semantic Search II
  • Improving Passage Retrieval with Zero-Shot Question Generation. Devendra Singh Sachan et.al. EMNLP 2022
  • Promptagator: Few-shot Dense Retrieval From 8 Examples. Zhuyun Dai et.al. ICLR 2023.
  • UDAPDR: Unsupervised Domain Adaptation via LLM Prompting and Distillation of Rerankers. Jon Saad-Falcon, Omar Khattab
  • InPars: Data Augmentation for Information Retrieval using Large Language Models. Luiz Bonifacio et.al. Arxiv 2022.
  • Query Expansion by Prompting Large Language Models. Rolf Jagerman et.al. Arxiv 2023
11/20 Search and LLMs
  • Generate rather than Retrieve: Large Language Models are Strong Context Generators. Wenhao Yu et.al.
  • Zero-Shot Listwise Document Reranking with a Large Language Model. Xueguang Ma et.al. Arxiv 2023.
  • Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agent. Weiwei Sun et.al. Arxiv 2023.
  • Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering, Izacard et. al.,
  • Demonstrate–Search–Predict: Composing retrieval and language models for knowledge-intensive NLP.
11/27 Entity and Relationship Extraction
  • Scalable Zero shot entity linking with Dense entity retrieval, We et. al., EMNLP 2020
  • A Frustratingly Easy Approach for Entity and Relation Extraction, Zhong et. al.,
  • Text-to-Table: A New Way of Information Extraction,Wu et. al., ACL 2022
12/04 Test-to-SQL
  • Natural SQL: Making SQL Easier to infer from Natural Language Specifications
  • DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction
  • interactive text-to-SQL Generation via Editable Step-by-Step Explanations
  • A comprehensive evaluation of chatgpt's zero-shot text-to-SQL capability
  • Evaluating text-to-SQL capabilities of large language models

Grading
WeightItemMinimal markModerate markHigh mark
30%ParticipationPresentTalkativeInsightful comments or questions
20%PresentationsFactually correctDesigned and delivered wellTransmits effectively key points, implications, etc.
5%Quality of feedback to peersFocus on nitpicks and minutiaeSuggest incremental improvementsIdentify structural strengths and flaws
45%Final projectUnambitious and/or badly plannedPartially implemented and/or poorly presentedImplemented successfully with key learning points presented
Office Hours

I will be holding office hours after the class at BA 5240. Also available by Appointment: Send me an email (use subject CSC2508)

Late Policy and Deliverables
There will be no late dates for the project deliverables and no late dates for in class questions. Extensions may be granted in the case of a severe medical or family emergency.
Credit
The template of this website was created by HazyReseach@Stanford.