CSC 2508 Advanced Data Systems

Fall 2022, Mondays 9-11am, ES1042

Class repository: CSC2508 Drive

Class Material Upload repository: CSC2508 Upload Drive

Description

The maturity of several Deep Learning technologies has influenced the design and instigated re-thinking several design principles of data processing systems and architectures. The goal of this course is two-fold. First present a review of the fundamental design components of modern data management architectures including a review of relational and NoSQL systems. Second review and explore how fundamental components can be re-designed by incorporating Deep Learning principles and techniques and explore the resulting (performance and system) implications. We will also review and investigate a few novel data management application scenarios that are uniquely enabled by merging Deep Learning and query processing technologies.

Class Logistics

Class Format

  • This is a seminar course. Each class will consist of presentations and discussion. Students will be required to do a class project for the course. A significant portion of the grade will be based on class participation, which includes paper presentations, contributions to paper reviews, and paper discussions. Because of the interactive nature of the course, and space limitations, auditing is discouraged.

Prerequisites

  • Background in algorithms, databases, machine learning suggested.

Assignments

  • A research class project: Each of you will work on a research project and file a single submission. The project will be broken down to three assignments: (1) initial research proposal, (2) Intermediate report, (3) final report and presentation
  • Proposed class projects will be described by the instructor. Feel free to discuss your ideas with the instructor and propose your own project. However the project you propose HAS to be associated with the material in the class. This is very important and it is not up for discussion. The project should have a research component. Project ideas will be outlined in class but you are responsible for proposing your project. Some background reading is associated with each project. The project proposal (due date Oct 24) should contain the following information:
    • Topic to be addressed and the nature of the problem
    • State of the art (prior work, what remains unsolved, etc.)
    • The proposed technique to be implemented/evaluated
    • To what degree the project will repeat existing work
    • Specific, measurable goals: deliverables, and dates you expect to produce them
    Project proposals should be a two pages at most. A project status report is due on Nov 21. The status report should include a description of progress to date and what is expected to be accomplished by the final project presentation day.
  • Questions and Responses (QRs): During each class you will need to provide 2 questions in writing for each (mandatory) paper. Questions will be individual assignments. For each class one of you will be responsible to lead a discussion on the papers and provide answers to all questions provided during class. Questions and responses will then be summarized in a written report which will be submitted and shared with everyone in the class.
  • There will be no midterm or final exams.

Final Project Reports

  • Final project reports are due Jan 2 2023
  • Reports should be structured as a research paper, containing an introduction to the problem solved and its importance, summary of related work with citations, a section detailing if the methodology is new and proposed by you or whether it is from the literature, along with citations of any techniques you implemented, a description of the technical developments outlining algorithms and their description, detailed experimental evaluation, conclusions and thoughts on future work and extensions.
  • Essentially your report should mirror the structure of the papers you have been reading during the term

Misc

  • Class time may be adjusted to accomodate external talks releated to the class.

Tentative Lecture Plan (Subject to Change)


Date Topic Reading Material
9/12 Logistics and Review of relational technology
  • Background: Review Chapters 3,4,10,11,13,14,15 from ramakrishnan's book (Database Management Systems) 3rd Edition.
9/19 Overview of noSQL
  • Davoudian et. al., A Survey of NoSQL Stores, ACM Computing Surveys 2018
  • J. Dean and S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, in OSDI 2004.
  • Spark Apache Spark a unified Engine for Big Data Processing, CACM 2016. Zaharia et. al.,
9/26 Indexing
  • Kraska et. al., SageDB A Learned Database System, CIDR 2018
  • Kraska et. al., The Case for Learned Indicies, SIGMOD 2018
  • Benchmarking Learned Indexes, VLDB 2021
  • M. Mintzemaher, A Model For Learned Bloom Filters and Related Structures
10/3 Indexing & Searching I
  • Gionis et. al., Similarity Search in High Dimensions Using Hashing VLDB 1999
  • R-Trees: A Dynamic Index Structure for Spatial Searching, A. Guttman, SIGMOD 1984
  • The ML-Index: A Multidimensional, Learned Index for Point, Range, and Nearest-Neighbor Queries, EDBT 2020
  • Tsunami: A learned multi dimensional index for correlated data and skewed workloads, VLDB 2021
  • Learning Multi-dimensional Indexes SIGMOD 2020
10/17 Indexing and Searching II
  • H. J├ęgou, M. Douze and C. Schmid, "Product Quantization for Nearest Neighbor Search," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 1, pp. 117-128, Jan. 2011
  • T. Ge, K. He, Q. Ke and J. Sun, "Optimized Product Quantization," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 4, pp. 744-755, April 2014
  • A. Babenko and V. Lempitsky, "The inverted multi-index," 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2012
10/24 Indexing and Searching III
  • Y. A. Malkov and D. A. Yashunin, "Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 4, pp. 824-836, 1 April 2020
  • Fu, Cong and Xiang, Chao and Wang, Changxu and Cai, Deng, Fast Approximate Nearest Neighbor Search with the Navigating Spreading-out Graph, VLDB 2019
  • Wang, Mengzhao and Xu, Xiaoliang and Yue, Qiang and Wang, Yuxiang, A Comprehensive Survey and Experimental Comparison of Graph-Based Approximate Nearest Neighbor Search, VLDB 2021
10/31 Approximate Queries
  • Mozafari et. al., A Handbook for Building an Approximate Query Engine TKDE 2019
  • Ma et. al., Revisiting Approximate Query Engines with machine Learning models SIGMOD 2019
  • Thirumuruganathan et. al., Approximate Query Processing Using Deep Generative Models ICDE 2020
  • Savva et. al., ML-AQL: Query Driven Approximate Query Processing based on ML. arxiv
11/14 Entity Resolution
  • Mugdal et. al., Deep Learning for Entity Matching: A Design Space Exploration. SIGMOD 2018
  • Thirumuruganthan et. al., Deep learning for blocking in entity matching: a design space exploration. VLDB 2021
  • Doan et. al., Deep Entity Matching with Pre-Trained Language Models. VLDB 2021
  • Neural Networks for Entity Matching a survey: ACM Computing Surveys 2021
11/21 DNN Training
  • Analyzing and Minigating Data Stalls in DNN Training, VLDB 2021
  • Progressive Compressed Records: Taking a Byte out of Deep Learning Data, VLDB 2021
  • Kucknik et. al., Plumber: Diagnosing and removing performance bottlenecks in Machine Learning Data Pipelines MLsys 2022
  • Isenko et. al., Where is my training bottleneck: Hidden Tradeoffs in deep learning preprocessing pipelines, SIGMOD 2022
11/28 Caching and Prefetching
  • Renen et. al., Managing Multi tier buffering in Database systems, SIGMOD 2018
  • Zhou et. al., Spitfire: A three tier buffer manager for non volatile and volatile memory, SIGMOD 2021
  • Vietri et. al., Driving Page Replacement with ML-Based leCar, HotStorage 2018
  • Liu et. al., An Imitation Based Approach for Cache replacement PMLR 2020
  • Lykouris et. al., Better Caching with ML based advice, MLSys 2018
12/05 Labelling Training Data
  • Snorkel: Rapid Training Data Creation with Weak Supervision, VLDB 2017
  • Snuba: Automating Weak Supervision to label Training data, VLDB 2019
  • A Survey on Programmatic Weak Supervision
  • End-to-End Weak Supervision
12/08 Query Optimization
  • Overview of Query Optimization in relational Systems, S. Chaudhuri PODS 1998
  • Bao: Making Learned Query Optimization Practical, SIGMOD 2021
  • Vaidya, Kapil and Dutt, Anshuman and Narasayya, Vivek and Chaudhuri, Surajit, Leveraging Query Logs and Machine Learning for Parametric Query Optimization, VLDB 2021
  • Steering Query Optimizers: A Practical Take on Big Data Workloads, SIGMOD 2021

Grading
WeightItemMinimal markModerate markHigh mark
30%ParticipationPresentTalkativeInsightful comments or questions
20%PresentationsFactually correctDesigned and delivered wellTransmits effectively key points, implications, etc.
5%Quality of feedback to peersFocus on nitpicks and minutiaeSuggest incremental improvementsIdentify structural strengths and flaws
45%Final projectUnambitious and/or badly plannedPartially implemented and/or poorly presentedImplemented successfully with key learning points presented
Office Hours

I will be holding office hours after the class at BA 5240. Also available by Appointment: Send me an email (use subject CSC2508)

Late Policy and Deliverables
There will be no late dates for the project deliverables and no late dates for in class questions. Extensions may be granted in the case of a severe medical or family emergency.
Credit
The template of this website was created by HazyReseach@Stanford.