This is the course page for the 2019 edition of Object Recognition in Images and Video for the PhD in Smart Computing offered by the Universities of Florence, Pisa, and Siena.
Lecture 1: 10/05/2019 (Introduction)
Location: Aula 110 Santa Marta @ 10:15
In this first lecture I will introduce the basic problem of object recognition with some history of the field, an overview of the basic techniques and tools we will employ, and an introduction to the First Big Breakthrough that gave birth to modern object recognition – the Bag of Visual Words model. In this lecture we will trace the development of the Bag-of-Words (BOW) model through the first decade of the 21st century. We will see how advances in pooling (e.g. spatial pyramids and sparse coding) and and feature coding (e.g. Fisher vectors) lead to steady and significant progress in object recognition performance. We will also look at the related problem of object detection and see how descriptors like HOGs and representations like Deformable Part Models (DPMs) led to significant advances also in object localization.
Required Reading
Visual Categorization with Bags of Keypoints, Gabriella Csurka, Christopher R. Dance, Lixin Fan, Jutta Willamowski, Cédric Bray. In: European Conference on Computer Vision (ECCV), 2004.
Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories, S Lazebnik, C Schmid, J Ponce. In: Computer Vision and Pattern Recognition (CVPR), 2006.
Improving the fisher kernel for large-scale image classification, F Perronnin, J Sánchez, T Mensink. In: European Conference on Computer Vision, 2010.
Locality-constrained linear coding for image classification, J Wang J Yang, K Yu, F Lv, T Huang, Y Gong. In: Computer Vision and Pattern Recognition (CVPR), 2010.
Recommended Reading
Content-based image retrieval at the end of the early years, Smeulders, A. W., Worring, M., Santini, S., Gupta, A., and Jain, R. In: IEEE Transactions on pattern analysis and machine intelligence, 2000.
Chapter 1 of Vision: Computational Investigation into the Human Representation and Processing of Visual Information, David Marr, MIT Press, 1980.
Distinctive Image Features from Scale-Invariant Keypoints, David G. Lowe. In: International Journal of Computer Vision, 2004.
Object detection with discriminatively trained part-based models, PF Felzenszwalb, RB Girshick, D McAllester, D Ramanan. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010.
Visual word ambiguity, JC Van Gemert, CJ Veenman, AWM Smeulders, JM Geusebroek. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010.
The devil is in the details: an evaluation of recent feature encoding methods. K Chatfield, VS Lempitsky, A Vedaldi, A Zisserman. In: British Machine Vision Conference, 2011.
Lecture 2: 17/05/2019 (The Shot Heard ‘Round the World)
Location: Aula 110 Santa Marta @ 10:15
In this lecture we will look at the revolutionary breakthrough that occurred in 2012: the re-introduction of neural networks into the modern discussion on object recognition. We will study some of the classic and contemporary models of Convolutional Neural Networks (CNNs) that continue to revolutionize the field. We will also look at extensions of these models to the detection problem.
Extra Resources
Required Reading
ImageNet Classification with Deep Convolutional Neural Networks. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. In: Proceedings of NIPS, 2012.
Very Deep Convolutional Networks for Large-Scale Image Recognition. Karen Simonyan and Andrew Zisserman. In: arXiv preprint arXiv:1409.1556, 2014.
Deep Residual Learning for Image Recognition. Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. In: Proceedings of CVPR, 2016.
Fast-RCNN. R. Girshick. In: Proceedings of ICCV, 2015.
Recommended Reading
Gradient-based learning applied to document recognition. Y. LeCun, L. Bottou, Y. Bengio, and P Haffner. In: Proceedings of the IEEE, 1998.
Return of the Devil in the Details: Delving Deep into Convolutional Nets Ken Chatfield, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. In: Proceedings of BMVC, 2015.
Lecture 3: 24/05/2019 (The State-of-the-art)
Location: Aula 110 Santa Marta @ 10:15
In this final lecture we will leverage what we have learned about the historical development of modern object detection to study some state-of-the-art topics in object recognition. We will see the state-of-the-art detector YOLO, how to convert a CNN into a fully-convolutional network for segmentation, how CNNs can be used to learn generative models of image distributions, and how to (partially) mitigate the need for massive amounts of data via self-supervision.
Extra resources
Required Reading
You only look once: Unified, real-time object detection. J Redmon, S Divvala, R Girshick, A Farhadi. In: Proceedings of CVPR, 2016.
Fully convolutional networks for semantic segmentation. E Shelhamer, J Long, T Darrell. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
Unsupervised representation learning with deep convolutional generative adversarial networks. A Radford, L Metz, S Chintala. In: arXiv preprint arXiv:1511.06434, 2015.
Exploiting Unlabeled Data in CNNs by Self-supervised Learning to Rank. X. Liu, J. van de Weijer, A. D. Bagdanov. In: IEEE transactions on pattern analysis and machine intelligence, 2019.
Recommended Reading
Anything that catches your fancy from CVPR, NIPS, ICCV, ECCV, BMVC, ICLR.
Lecture 4: 31/05/2019 (Object Recognition in Video)
Location: Aula 110 Santa Marta @ 10:15
TBD
Final Examination
There will be a final, oral examination for this course. This exam will consist of a 20-minute, reading-group style presentation on a paper selected from a recent edition of a major computer vision conference. Papers from CVPR, ECCV, ICCV, BMVC, NIPS, etc., are all fair game. Please confer with me before preparing the presentation for your final examination.
These course presentations will be scheduled approximately 3-4 weeks after the end of the course.