CSC 2508 Advanced Data Management Systems

Fall 2019, Mondays 9-11am, BA 2159

Class repository: CSC2508 Drive

Description

The maturity of several Deep Learning technologies has influenced the design and instigated re-thinking several design principles of data management systems and architectures. The goal of this course is two-fold. First present a review of the fundamental design components of modern data management architectures including a review of relational and NoSQL systems. Second review and explore how fundamental components can be re-designed by incorporating Deep Learning principles and techniques and explore the resulting (performance and system) implications. We will also review and investigate a few novel data management application scenarios that are uniquely enabled by merging Deep Learning and query processing technologies.

Class Logistics

Class Format

  • This is a seminar course. Each class will consist of presentations and discussion. Students will be required to do a class project for the course. A significant portion of the grade will be based on class participation, which includes paper presentations, contributions to paper reviews, and paper discussions. Because of the interactive nature of the course, and space limitations, auditing is discouraged.

Prerequisites

  • Background in algorithms, databases, machine learning suggested.

Assignments

  • A research class project: Each of you will work on a research project and file a single submission. The project will be broken down to three assignments: (1) initial research proposal, (2) Intermediate report, (3) final report and presentation
  • Proposed class projects will be described by the instructor. Feel free to discuss your ideas with the instructor and propose your own project. However the project you propose HAS to be associated with the material in the class. This is very important and it is not up for discussion. The project should have a research component. Project ideas will be outlined in class but you are responsible for proposing your project. Some background reading is associated with each project. The project proposal (due date Oct 21) should contain the following information:
    • Topic to be addressed and the nature of the problem
    • State of the art (prior work, what remains unsolved, etc.)
    • The proposed technique to be implemented/evaluated
    • To what degree the project will repeat existing work
    • Specific, measurable goals: deliverables, and dates you expect to produce them
    Project proposals should be a couple of pages at most. A project status report is due on Nov 11. The status report should include a description of progress to date and what is expected to be accomplished by the final project presentation day.
  • Questions and Responses (QRs): During each class you will need to provide 3 questions in writing for each (mandatory) paper. Questions will be individual assignments. For each class one of you will be responsible to lead a discussion on the papers and provide answers to all questions provided during class. Questions and responses will then be summarized in a written report which will be submitted and shared with everyone in the class.
  • There will be no midterm or final exams.

Misc

  • Class time may be adjusted to accomodate external talks releated to the class.
  • Google drive for deliverables: CSC2508 Drive

Tentative Lecture Plan (Subject to Change)


# Date Topic Lecture Materials Reading Material
1 9/9 Logistics and Review of relational technology Lecture 1
  • Background: Read Chapter 3,4,10,13,14,15 from ramakrishnan's book (Database Management Systems) 3rd Edition.
  • Overview of Query Optimization in relational Systems, S. Chaudhuri PODS 1998
  • R-Trees: A Dynamic Index Structure for Spatial Searching, A. Guttman, SIGMOD 1984
  • Improved Selectivity Estimation by Combining Knowledge from Sampling Synopses, Muller et. al., VLDB 2018
2 9/16 Overview of noSQL Lecture 2
  • M. Stonebraker, SQL databases v. NoSQL databases, in Communications. ACM 53(4): 10-11 (2010).
  • J. Dean and S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, in OSDI 2004.
  • Jeffrey Dean, Sanjay Ghemawat: MapReduce: a flexible data processing tool. 72-77, CACM January 2010
  • Spark Apache Spark a unified Engine for Big Data Processing, CACM 2016. Zaharia et. al.,
3 9/23 Indexing Lecture 3
  • SageDB A Learned Database System, CIDR 2018
  • The Case for Learned Indicies, SIGMOD 2018
  • Considerations for Handling Updates in Learned Index Structures, aiDM 2019
  • ALEX: An updatable Adaptive Learned Index, 2019
  • A Model For Learned Bloom Filters and Related Structures
4 9/30 Query Optimization Lecture 4
  • Towards A Hands Free Query Optimizer with Deep Learning, CIDR 2019 (mandatory)
  • NEO: A Learned Query Optimizer (mandatory)
  • Plan Structured Deep Neural Network Models for Query Performance Prediction, VLDB 2019 (optional)
  • Learning State Representations for Query Optimization With Deep Reenforcement Learning DEEM 2018 (mandatory)
  • Deep Reinforcement Learning for join order enumeration aiDM 2018 (mandatory)
  • Learning to Optimize Join Queries with Deep Reinforcement Learning (optional)
5 10/7 Selectivity Estimation Lecture 5
  • Learned Cardinalities: Estimating Correlated Joins With Deep Learning CIDR 2019 (mandatory)
  • Multi attribute selectivity estimation using deep learning (mandatory)
  • MADE: Masked Autoencoder for Distribution Estimation NIPS 2016 (mandatory)
  • An Empirical Analysis of Deep Learning for Cardinality Estimation (optional)
6 10/21 Data Exploration Lecture 6
  • Intelligent Rollups in Multidimensional Data VLDB 2001
  • Interactive Data Exploration with Smart Drill down VLDB 2015
  • Neural Cubes Deep Representations for Visual Data Exploration
  • Visual Exploration of Machine Learning Results Using Data Cube Analysis HILDA 2016
7 10/28 Entity Resolution Lecture 7
  • A Theory of Record Linkage
  • Magellan: Toward Building Entity matching Systems VLDB 2016
  • Entity Matching: How Similar is Similar VLDB 2011
  • Entity Resolution Tutorial
  • Deep Learning for Entity Matching a design space exploration, SIGMOD 2018
8 11/11 ER Explanations Lecture 8
  • Why should I trust you? Explaining the predictions of any classifier, KDD 2016 (mandatory)
  • On the robustness of interpredibility methods (mandatory)
  • Towards robust interpretability with self-explaining neural networks
  • Interpreting deep learning models for entity resolution: an experience report using LIME (mandatory)
  • ExplainER: Entity Resolution Explanations
9 11/18 Data Management For Video Streams Lecture 9
  • NoScope: Optimizing Neural Network Queries over Video at Scale, VLDB 2017 (mandatory)
  • BlazeIt: Fast Exploratory Video Queries using Neural Networks (mandatory)
  • Video Monitoring Queries (mandatory)
  • SVG: Streaming Video Queries, SIGMOD 2019 (optional)
  • Challanges and Opportunities in DNN based Video Analytics CIDR 2019 (optional)
  • Focus: Querying Large Video Datasets with Low Latency and Low Cost, OSDI 2016 (optional)
10 ~11/25 RDBMS for Machine Learning Lecture 10
  • Scalable Linear Algebra on a relational database systems, ICDE 2017
  • Scaling machine Learning via compressed linear algebra, VLDB 2016
  • SystemML Declarative Machine Lerning on Spark VLDB 2016
  • materialization Optimizations for feature selection workloads SIGMOD 2014
11 ~12/2 Project Presentations Lecture 11

Grading
WeightItemMinimal markModerate markHigh mark
30%ParticipationPresentTalkativeInsightful comments or questions
20%PresentationsFactually correctDesigned and delivered wellTransmits effectively key points, implications, etc.
5%Quality of feedback to peersFocus on nitpicks and minutiaeSuggest incremental improvementsIdentify structural strengths and flaws
45%Final projectUnambitious and/or badly plannedPartially implemented and/or poorly presentedImplemented successfully with key learning points presented
Office Hours

By Appointment: BA 5240

Late Policy and Deliverables
There will be no late dates for the project deliverables and no late dates for in class questions. Extensions may be granted in the case of a severe medical or family emergency.
Credit
The template of this website was created by HazyReseach@Stanford.