**Syllabus of **

Techniques of Data Mining for Transportation

Course No. DB021207

Course Identification: optional

Credits /Hours: 2/32

Semester: Spring 2020

School/Department: Transportation School

Specialty: Traffic engineering, road engineering and etc.

Instructor: Professor Shuyan Chen, Transportation College, Southeast University

**Course Description:**

The Motivation for this course started with the development of information techniques. The amount of traffic data collected is growing at an increasing rate. At the same time, the users of these data are expecting more sophisticated analysis of these large data sets. The area of data mining has developed over the last decade to address this problem.

Data Mining is often defined as discovering useful but hidden patterns or relationships in a database, which is one of the hottest fields in Computer Science.Findingpatterns, trends, and outliers in these datasets, and summarizing themwith simple quantitative models, is one of the grand challenges of the informationage—turning data intoknowledge.

Data mining programs are intended to search through data for hidden relationships and patterns in the datasets. This approach is particularly relative to intelligent transportation system. It can be very helpful for traffic researchers and managers to solve traffic problems. So, data mining is a good field to study not only for computer science students, but also for transportation students, because the same techniques can be used to solve many traffic problems that may arise during their career in the future

This course provides an introduction to data mining as applied to transportation systems. It intends to cover the basic concepts of data mining as well as specific applications to Transportation systems.

**Prerequisite**:

Knowledge of probability, statistics and linear algebra at the undergraduate level; Basic knowledge of traffic engineering, and basic programing skills.

**Textbook: **

Jiawei Han, Micheline Kamber and Jian Pei, Data Mining: Concepts and Techniques, Morgan Kaufmann, 3rd edition, 2011.

**Reference books:**

A. Ian H.Witten, Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, San Francisco: Morgan Kaufmann Publishers, 3rd ed. 2011.

B. Charu C. Aggarwal, Data Mining: The Textbook, Springer, May 2015.

C. Pang-Ning Tan, Michael Steinbach, Vipin Kumar, Introduction to Data Mining, Pearson, 1st Edition, 2005.

D. Christopher M. Bishop, Pattern recognition and machine learning, the Morgan Kaufmann series in information science and statistics, Springer Science, 2006.

E. Required handouts will be provided by the instructor.

**Course Objectives:**

The objectives of the course are to present the basic concepts of data mining, the principles and ideas underlying the practice of data mining, including data preprocess, instance based learning, decision tree, Support Vector Machine, outlier mining, and ensemble learning.

After completing this course, students will have the ability to understand the fundamental terms and concepts of data mining, and to use the methods taught in class for the analysis and processing of real transportation data.

**T****entative ****C****ourse ****O****utline:**

**Chapter 1. Introduction to data mining**

Chapter 1 is Introduction to Data Mining, you will learn what data mining is, On what kind of data, functionality of Data mining, Origin of Data Mining, and so on.

1.1 What is data mining?

1.2 Data mining functionality

1.3 Data Mining Techniques

1.4 Summary

**Chapter 2. Data pre-processing**

You will learn techniques of preprocessing data, including data cleansing, Data integration, data reduction, and transformation.

2.1 Why preprocess the data?

2.2 Data cleaning

2.3 Data integration

2.4 Data reduction

2.5 Data transformation

2.6 Summary

**Chapter 3. Instance based learning**

The students will study instance based learning, an example of lazy learner. Learn three components of KNN and two Variants of kNN.

3.1 Overview of IBL

3.2 Components of KNN

3.3 Variants of kNN

3.4 Summary

**Chapter 4. ****Decision trees**

Study decision tree representation, including how to obtain classification rules from a tree constructed, how to generate a decision tree, including how to calculate entropy and information gain, how to solve overfitting problem by tree pruning.

**4.1 Decision Tree Representation**

**4.2 Construct Decision Tree**

**4.3 Overfitting and Tree Pruning**

**4.4 Pros and Cons of DTs**

**Chapter 5. Support vector machine**

**Learn Linear Support Vector Machine and non-Linear Support Vector Machine, how to extend these algorithms to allow for multiclass classification, as well as support vector regression.**

**5.1 Linear SVMs**

**5.2 Non-linear SVMs**

**5.3 Multiclass**

**5.4 Support vector regression**

**5.5 Summary**

**Chapter ****6****. Outlier mining**

**Learn three techniques to detect outliers, including statistic-based method, distance-based method, and density-based method.**

**6.1 Background of Outlier Detection**

**6.2 statistic-based method**

**6.3 distance-based method**

**6.4 density-based method**

**8.5 Conclusions**

**Chapter 7. Ensemble leaning**

**Study several classical ensemble methods, including Bagging, boosting, Cross validated Committees, and random forests. Know how to generate members and combine schemes in general. Besides, study techniques to improve the classification performance for class-imbalanced data.**

**7.2 General Idea of Ensemble Methods**

**7.2 Popular methods for ensemble**

**7.3 Class-Imbalanced Data**

**Teaching Format:**

**Classes will be in a combination of lecture and discussion. Students are expected to participate actively in class discussions. There will be reading assigned for each class and students are expected to be prepared to answer questions. **

**This course also requires the student to do exercises, complete a group project and pass the final exam. **

**Exercises: **There will be six exercises provided to the students, which correspond to classroom teaching. It will take 2 classes to finish each one. Each student is expected to complete the exercise and homework assignments individually and timely. Assignments should be submitted in advance of the due date and no extensions will be given.

**The studentsare encouraged to complete the exercises with tools such as WEKA software, or write code with any computer languages that they are familiar with.**

**Project: **We also provide 6 projects for the students to choose from. All the projects focus on the practice of data mining techniques related to transportation. It is estimated that the students will spend 10 classes to fulfill one project. Three students will form a group (4 members maximum) for a course project, and work together on this project based on a problem of interest to them. Each group is required to give a well prepared PowerPoint presentation to the class. Each group should submit a final report at the required date through email. The report should be written in WORD following a provided format.

**Students can use techniques discussed in the class (covered by this course), and use any programming language that they are familiar with, such as C++, Matlab, R, Python, etc.**

**G****rading ****P****olicy:**

**The score is composed of the following parts, 30% for exercise and assignments, 30% for courseproject report, and 40% for individual presentation in class.**

**Reports will be evaluated by the instructor, based on your technical soundness, format, results, and in-depth thinking and so on.**

**v Exercise and Assignments (In class and HW)：30%**

**v Course Project： 30%**

**v Presentation：40%**