CSC 2508 Advanced Data Systems

Fall 2021, Mondays 9-11am, Online

Class repository: CSC2508 Drive

Class Material Upload repository: CSC2508 Upload Drive


The maturity of several Deep Learning technologies has influenced the design and instigated re-thinking several design principles of data processing systems and architectures. The goal of this course is two-fold. First present a review of the fundamental design components of modern data management architectures including a review of relational and NoSQL systems. Second review and explore how fundamental components can be re-designed by incorporating Deep Learning principles and techniques and explore the resulting (performance and system) implications. We will also review and investigate a few novel data management application scenarios that are uniquely enabled by merging Deep Learning and query processing technologies.

Class Logistics

Class Format

  • This is a seminar course. Each class will consist of presentations and discussion. Students will be required to do a class project for the course. A significant portion of the grade will be based on class participation, which includes paper presentations, contributions to paper reviews, and paper discussions. Because of the interactive nature of the course, and space limitations, auditing is discouraged.


  • Background in algorithms, databases, machine learning suggested.


  • A research class project: Each of you will work on a research project and file a single submission. The project will be broken down to three assignments: (1) initial research proposal, (2) Intermediate report, (3) final report and presentation
  • Proposed class projects will be described by the instructor. Feel free to discuss your ideas with the instructor and propose your own project. However the project you propose HAS to be associated with the material in the class. This is very important and it is not up for discussion. The project should have a research component. Project ideas will be outlined in class but you are responsible for proposing your project. Some background reading is associated with each project. The project proposal (due date Oct 25) should contain the following information:
    • Topic to be addressed and the nature of the problem
    • State of the art (prior work, what remains unsolved, etc.)
    • The proposed technique to be implemented/evaluated
    • To what degree the project will repeat existing work
    • Specific, measurable goals: deliverables, and dates you expect to produce them
    Project proposals should be a two pages at most. A project status report is due on Nov 22. The status report should include a description of progress to date and what is expected to be accomplished by the final project presentation day.
  • Questions and Responses (QRs): During each class you will need to provide 3 questions in writing for each (mandatory) paper. Questions will be individual assignments. For each class one of you will be responsible to lead a discussion on the papers and provide answers to all questions provided during class. Questions and responses will then be summarized in a written report which will be submitted and shared with everyone in the class.
  • There will be no midterm or final exams.

Final Project Reports

  • Final project reports are due Jan 3 2021
  • Reports should be structured as a research paper, containing an introduction to the problem solved and its importance, summary of related work with citations, a section detailing if the methodology is new and proposed by you or whether it is from the literature, along with citations of any techniques you implemented, a description of the technical developments outlining algorithms and their description, detailed experimental evaluation, conclusions and thoughts on future work and extensions.
  • Essentially your report should mirror the structure of the papers you have been reading during the term


  • Class time may be adjusted to accomodate external talks releated to the class.

Tentative Lecture Plan (Subject to Change)

# Date Topic Lecture Materials Reading Material
1 9/13 Logistics and Review of relational technology Lecture 1
  • Background: Review Chapters 3,4,10,11,13,14,15 from ramakrishnan's book (Database Management Systems) 3rd Edition.
  • R-Trees: A Dynamic Index Structure for Spatial Searching, A. Guttman, SIGMOD 1984
2 9/20 Overview of noSQL Lecture 2
  • Davoudian et. al., A Survey of NoSQL Stores, ACM Computing Surveys 2018
  • J. Dean and S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, in OSDI 2004.
  • Spark Apache Spark a unified Engine for Big Data Processing, CACM 2016. Zaharia et. al.,
3 9/27 Indexing Lecture 3
  • Kraska et. al., SageDB A Learned Database System, CIDR 2018
  • Kraska et. al., The Case for Learned Indicies, SIGMOD 2018
  • Benchmarking Learned Indexes, VLDB 2021
  • Ding et. al., ALEX: An updatable Adaptive Learned Index, SIGMOD 2020
  • M. Mintzemaher, A Model For Learned Bloom Filters and Related Structures
4 10/4 ~Multi-d Indexing Lecture 4
  • Learned Index for Spatial Queries, MDM 2019
  • The ML-Index: A Multidimensional, Learned Index for Point, Range, and Nearest-Neighbor Queries, EDBT 2020
  • Tsunami: A learned multi dimensional index for correlated data and skewed workloads, VLDB 2021
  • Learning Multi-dimensional Indexes SIGMOD 2020
5 10/18 Sorting & NN Neighbors Lecture 5
  • Kristo et. al., The Case for a Learned Sorting Algorithm SIGMOD 2020
  • Dong et. al., Learning Space Partitions for NN Search ICLR 2020
  • Gionis et. al., Similarity Search in High Dimensions Using Hashing VLDB 1999
  • Billion Scale Similarity Search in GPUs
6 10/25 Approximate Queries Lecture 6
  • Mozafari et. al., A Handbook for Building an Approximate Query Engine TKDE 2019
  • Ma et. al., Revisiting Approximate Query Engines with machine Learning models SIGMOD 2019
  • Thirumuruganathan et. al., Approximate Query Processing Using Deep Generative Models ICDE 2020
  • Savva et. al., ML-AQL: Query Driven Approximate Query Processing based on ML. arxiv
7 11/1 Entity Resolution Lecture 7
  • Ebraheem et. al., Distributed Representations of Tuples for Entity Resolution PVLDB 2018
  • Brunner et. al., Entity Matching with Transformer Architecture EDBT 2020 (optional)
  • Doan et. al., Deep Entity Matching with pre-trained language models VLDB 2021
  • Neural Networks for Entity Matching a survey: ACM Computing Surveys 2021
  • Mugdal et. al., Deep Learning for Entity Matching a design space exploration, SIGMOD 2018
8 11/8 DNN Training Lecture 8
  • Analyzing and Minigating Data Stalls in DNN Training, VLDB 2021
  • Progressive Compressed Records: Taking a Byte out of Deep Learning Data, VLDB 2021
  • A Deep NN Favourable JPEG based image compression framework
  • A Domain specific supercomputer for training Deep NN
9 11/15 Video Streams Lecture 9
  • Kang et. al., Challanges and Opportunities in DNN based Video Analytics CIDR 2019 (optional)
  • Kang et. al., NoScope: Optimizing Neural Network Queries over Video at Scale, VLDB 2017
  • Kang et. al., BlazeIt: Fast Exploratory Video Queries using Neural Networks PVLDB 2020
  • Xarchakos et. al., Video Monitoring Queries, ICDE 2020
  • Xarchakos et. al., Querying for Interactions, ICDE 2021 (optional)
  • Evaluating Temporal Queries Over Video Feeds, SIGMOD 2021
10 11/22 Labelling Training Data Lecture 10
  • Snorkel: Rapid Training Data Creation with Weak Supervision, VLDB 2017
  • Scuba: Automating Weak Supervision to label Training data, VLDB 2019
  • Learning from Rules Generalizing labelled exemplars, ICLR 2020
  • Data Programming Using Contineous and Quality Guided Labelling Functions, AAAI 2020
  • Semi Supervised Data Programming with Subset Selection (optional)
11 11/29 information Extraction Lecture 11
  • Incremetal Knowledge base construction using DeepDive VLDB 2015
  • Scalable Zero Shot Entity Linking with Dense Entity Retrieval
  • Matching the blanks: Distributional Similarity for relation learning
12 12/6 Query Optimization Lecture 12
  • Overview of Query Optimization in relational Systems, S. Chaudhuri PODS 1998
  • Bao: Making Learned Query Optimization Practical, SIGMOD 2021
  • Steering Query Optimizers: A Practical Take on Big Data Workloads, SIGMOD 2021

WeightItemMinimal markModerate markHigh mark
30%ParticipationPresentTalkativeInsightful comments or questions
20%PresentationsFactually correctDesigned and delivered wellTransmits effectively key points, implications, etc.
5%Quality of feedback to peersFocus on nitpicks and minutiaeSuggest incremental improvementsIdentify structural strengths and flaws
45%Final projectUnambitious and/or badly plannedPartially implemented and/or poorly presentedImplemented successfully with key learning points presented
Office Hours

I will be holding office hours online after the class. Also available by Appointment: Send me an email (use subject CSC2508) and we will connect online

Late Policy and Deliverables
There will be no late dates for the project deliverables and no late dates for in class questions. Extensions may be granted in the case of a severe medical or family emergency.
The template of this website was created by HazyReseach@Stanford.