Image, Video and Multimedia Systems - Stanford University


July 2014: We demonstrate a novel multimedia system that continuously indexes videos and enables real-time search using images, with a broad range of potential applications. Television shows are recorded and indexed continuously, and iconic images from recent events are discovered automatically. Users can query an uploaded image or an image in the web. When a result is served, the user can play the video clip from the beginning or from the point in time where the retrieved image was found.

July 2014: Meaning Augmenting Art with Technology, Art++ aims to improve the experience of visitors in a museum gallery by proposing a new way of delivering information to them. Using augmented reality, Art++ will offer viewers an immersive and interactive learning experience by overlaying content directly on the objects through the viewfinder of a smartphone or tablet device. This project is sponsored by the Brown Institute for Media Innovation.

November 2013: Mobile phones, equipped with powerful processors, high resolution cameras, and sharp color displays, enable a new class of visual search applications. While the bag of image features based matching approach works nicely for most visual search applications, its success on retrieving images with text have been limited. Text in images are noticeable and descriptive, and yet its repetitive structures cause problems for retrieval pipelines based on image features. One way to get by this problem is to perform character recognition. However, wide variations observed in mobile phone taken images make it difficult for generic character recognition engines to recognize the characters reliably.

In this talk, we will introduce a new image retrieval framework that uses visual text information for visual search which aims to solve the problems mentioned. We will first introduce how we locate the visual text in images with background clutter. We describe a bottom-up text detection algorithm which extracts maximally stable extremal regions as character candidates. From these character candidates, text lines and words are formed. Then, we present a Word Histogram of Oriented Gradients (Word-HOG) descriptor that is generated from the detected word patches. The Word-HOG descriptor is shown to have better word matching performance than state-of-the-art algorithms and character recognition engines. Furthermore, it can be efficiently compressed for data transmission. Finally, we describe the image retrieval framework that uses visual text. We show that the framework will work for scenarios with different types of visual text images. We also explain the database reduction scheme based on random sampling that enables us to perform large scale image retrieval.

November 2013: Many mobile visual search (MVS) systems compare query images captured by the mobile device's camera against a database of labeled images to recognize objects seen in the device's viewfinder. Practical MVS systems require a fast response to provide an interactive and compelling user experience. Thus, the recognition pipeline must be extremely efficient and reliable. Congestion on a server or slow transmissions of the query data over a wireless network could severely degrade the user experience.

We show how a memory-efficient database stored entirely on a mobile device can enable on-device queries that achieve a fast response. The image signatures stored in the database must be compact to fit in the device's small memory capacity, capable of fast comparisons across a large database, and robust against large geometric and photometric visual distortions. We first develop two methods for efficiently compressing a database constructed from feature histograms. The popular vocabulary tree is included in this framework. Our methods reduce the database memory usage by 4-5x without any loss in matching accuracy and have fast decoding capabilities. Subsequently, we then develop a third database representation based on feature residuals that is even more compact. The residual-based database reduces memory usage by 12-14x, requires only a small codebook, and performs image matching directly in the compressed domain.

With our compact database stored on a mobile device, we have implemented a practical MVS system that can recognize media covers, book spines, outdoor landmarks, artwork, and video frames out of a large database in less than 1 second per query. Our system uses motion analysis on the device to automatically infer user interest, select high-quality query frames, and update the pose of recognized objects for accurate augmentation. We also demonstrate how a continuous stream of compact residual-based signatures enables a low bitrate query expansion onto a remote server when network conditions are favorable. The query expansion improves image matching during the current query and updates the local on-device database to benefit future queries.

May 2013: We demonstrate EigenNews, a personalized television news system. Upon visiting the EigenNews website, a user is shown a variety of news videos which have been automatically selected based on her individual preferences. These videos are extracted from 16 continually recorded television programs using a multimodal segmentation algorithm. Relevant metadata for each video are generated by linking videos to online news articles. Selected news videos can be watched in three different layouts and on various devices (televisions, computers, mobile devices).

June 2011: ClassX is an interactive lecture streaming system developed by the Image, Video, and Multimedia Systems (IVMS) research group at Stanford University. ClassX significantly reduces the cost of capturing and publishing educational videos, since the only cost incurred is the cost of the recording equipment (a consumer grade HD camcorder, a wireless microphone, and a tripod).

ClassX offers high resolution video quality over bitrates typical of standard definition video using a technology called "region-of-interest video streaming". ClassX also offers numerous other forms of user interaction with content and with other users. The synchronization of slides with video and the association of video content with keywords makes it easier for the user to search the video content. The forum created for each published session provides a channel for discussions among the users and instructors.

Additionally, ClassX provides means of collecting user performance analytics that are of interest to the instructor. The quiz system and user activity monitoring system provide the instructor with information that may be used to gauge student performance.

May 2010: We present a mobile product recognition system for the camera-phone. By snapping a picture of a product with a camera-phone, the user can retrieve online information of the product. The product is recognized by an image-based retrieval system located on a remote server. Our database currently comprises more than one million entries, primarily products packaged in rigid boxes with printed labels, such as CDs, DVDs, and books. We extract low bit-rate descriptors from the query image and compress the location of the descriptors using location histogram coding on the camera-phone. We transmit the compressed query features, instead of a query image, to reduce the transmission delay. We use inverted index compression and fast geometric re-ranking on our database to provide a low delay image recognition response for large scale databases. Experimental timing results on different parts of the mobile product recognition system is reported in this work.

March 2009: Aditya Mavlankar recently built a demonstrator at Deutsche Telekom Research Laboratories in Berlin, Germany. The demonstrator shows interactive viewing of a soccer game. The view of the entire soccer playfield was obtained by stitching views from multiple cameras. The RoI can be chosen to conveniently focus on a part of the playfield. Also provided is an automatic mode in which the system can track the ball and choose the RoI. The automatic mode relieves navigation burden although the user can change the zoom factor.

©2013-2014 IVMS, Stanford University