CSC 2508 Advanced Data Systems

Fall 2020, Mondays 9-11am, Online

Class repository: CSC2508 Drive

Class Material Upload repository: CSC2508 Upload Drive

Description

The maturity of several Deep Learning technologies has influenced the design and instigated re-thinking several design principles of data management systems and architectures. The goal of this course is two-fold. First present a review of the fundamental design components of modern data management architectures including a review of relational and NoSQL systems. Second review and explore how fundamental components can be re-designed by incorporating Deep Learning principles and techniques and explore the resulting (performance and system) implications. We will also review and investigate a few novel data management application scenarios that are uniquely enabled by merging Deep Learning and query processing technologies.

Class Logistics

Class Format

  • This is a seminar course. Each class will consist of presentations and discussion. Students will be required to do a class project for the course. A significant portion of the grade will be based on class participation, which includes paper presentations, contributions to paper reviews, and paper discussions. Because of the interactive nature of the course, and space limitations, auditing is discouraged.

Prerequisites

  • Background in algorithms, databases, machine learning suggested.

Assignments

  • A research class project: Each of you will work on a research project and file a single submission. The project will be broken down to three assignments: (1) initial research proposal, (2) Intermediate report, (3) final report and presentation
  • Proposed class projects will be described by the instructor. Feel free to discuss your ideas with the instructor and propose your own project. However the project you propose HAS to be associated with the material in the class. This is very important and it is not up for discussion. The project should have a research component. Project ideas will be outlined in class but you are responsible for proposing your project. Some background reading is associated with each project. The project proposal (due date Oct 19) should contain the following information:
    • Topic to be addressed and the nature of the problem
    • State of the art (prior work, what remains unsolved, etc.)
    • The proposed technique to be implemented/evaluated
    • To what degree the project will repeat existing work
    • Specific, measurable goals: deliverables, and dates you expect to produce them
    Project proposals should be a two pages at most. A project status report is due on Nov 16. The status report should include a description of progress to date and what is expected to be accomplished by the final project presentation day.
  • Questions and Responses (QRs): During each class you will need to provide 3 questions in writing for each (mandatory) paper. Questions will be individual assignments. For each class one of you will be responsible to lead a discussion on the papers and provide answers to all questions provided during class. Questions and responses will then be summarized in a written report which will be submitted and shared with everyone in the class.
  • There will be no midterm or final exams.

Final Project Reports

  • Final project reports are due Jan 4 2021
  • Reports should be structured as a research paper, containing an introduction to the problem solved and its importance, summary of related work with citations, a section detailing if the methodology is new and proposed by you or whether it is from the literature, along with citations of any techniques you implemented, a description of the technical developments outlining algorithms and their description, detailed experimental evaluation, conclusions and thoughts on future work and extensions.
  • Essentially your report should mirror the structure of the papers you have been reading during the term

Misc

  • Class time may be adjusted to accomodate external talks releated to the class.

Tentative Lecture Plan (Subject to Change)


# Date Topic Lecture Materials Reading Material
1 9/14 Logistics and Review of relational technology Lecture 1
  • Background: Review Chapters 3,4,10,11,13,14,15 from ramakrishnan's book (Database Management Systems) 3rd Edition.
  • Overview of Query Optimization in relational Systems, S. Chaudhuri PODS 1998
  • R-Trees: A Dynamic Index Structure for Spatial Searching, A. Guttman, SIGMOD 1984
2 9/21 Overview of noSQL Lecture 2
  • Davoudian et. al., A Survey of NoSQL Stores, ACM Computing Surveys 2018
  • J. Dean and S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, in OSDI 2004.
  • Spark Apache Spark a unified Engine for Big Data Processing, CACM 2016. Zaharia et. al.,
3 9/28 Indexing Lecture 3
  • Kraska et. al., SageDB A Learned Database System, CIDR 2018
  • Kraska et. al., The Case for Learned Indicies, SIGMOD 2018
  • Ding et. al., ALEX: An updatable Adaptive Learned Index, SIGMOD 2020
  • M. Mintzemaher, A Model For Learned Bloom Filters and Related Structures
4 10/5 Sorting & NN Neighbors Lecture 4
  • Kristo et. al., The Case for a Learned Sorting Algorithm SIGMOD 2020
  • Dong et. al., Learning Space Partitions for NN Search ICLR 2020
  • Gionis et. al., Similarity Search in High Dimensions Using Hashing VLDB 1999
5 10/19 Approximate Queries I Lecture 5
  • Mozafari et. al., A Handbook for Building an Approximate Query Engine TKDE 2019
  • BlinkDB: Queries with bounded error and bounded response time on very large data, EuroSys 2013
  • Blink and It’s Done: Interactive Queries on Very Large Data. In PVLDB 5(12) 2012
  • Park et. al., VerdictDB: Universalizing Approximate Query Processing SIGMOD 2018
6 10/26 Approximate Queries II Lecture 6
  • Ma et. al., Revisiting Approximate Query Engines with machine Learning models SIGMOD 2019
  • Thirumuruganathan et. al., Approximate Query Processing Using Deep Generative Models ICDE 2020
  • Savva et. al., ML-AQL: Query Driven Approximate Query Processing based on ML. arxiv
7 11/2 Entity Resolution I Lecture 7
  • Christophides et. al., End to End Entity Resolution for Big Data: A Survey arxiv
  • Getoor et. al., Entity Resolution Tutorial VLDB 2012
  • Mugdal et. al., Deep Learning for Entity Matching a design space exploration, SIGMOD 2018
8 11/16 Entity Resolution II Lecture 8
  • Ebraheem et. al., Distributed Representations of Tuples for Entity Resolution PVLDB 2018
  • Brunner et. al., Entity Matching with Transformer Architecture EDBT 2020
  • Doan et. al., Deep Entity Matching with pre-trained language models arxiv
  • Firmani et. al., Interpreting Deep learning Models for Entity Resolution: An Experience Report, aiDB 2019
9 11/23 Data Management For Video Streams I Lecture 9
  • Kang et. al., Challanges and Opportunities in DNN based Video Analytics CIDR 2019
  • Rekall: Specifying video events using compositions of spatiotemporal labels, arxiv
  • Poms et. al., Scanner: Efficient Video Analytics at scale, TOG 2018
  • Hsieh et. al., Focus: Querying Large Video Datasets with Low Latency and Low Cost, OSDI 2016
10 ~11/30 Data Management For Video Streams II Lecture 10
  • Kang et. al., NoScope: Optimizing Neural Network Queries over Video at Scale, VLDB 2017
  • Kang et. al., BlazeIt: Fast Exploratory Video Queries using Neural Networks PVLDB 2020
  • Xarchakos et. al., Video Monitoring Queries, ICDE 2020
11 ~12/7 Misc Papers Lecture 11
  • Chronis et. al., External Merge Sort for Top-k queries SIGMOD 2020
  • Castro et. al., Data Market Platforms: Trading data assets to solve data problems, PVLDB 2020

Grading
WeightItemMinimal markModerate markHigh mark
30%ParticipationPresentTalkativeInsightful comments or questions
20%PresentationsFactually correctDesigned and delivered wellTransmits effectively key points, implications, etc.
5%Quality of feedback to peersFocus on nitpicks and minutiaeSuggest incremental improvementsIdentify structural strengths and flaws
45%Final projectUnambitious and/or badly plannedPartially implemented and/or poorly presentedImplemented successfully with key learning points presented
Office Hours

I will be holding office hours online after the class. Also available by Appointment: Send me an email (use subject CSC2508) and we will connect online

Late Policy and Deliverables
There will be no late dates for the project deliverables and no late dates for in class questions. Extensions may be granted in the case of a severe medical or family emergency.
Credit
The template of this website was created by HazyReseach@Stanford.