TubeFiler
learn more
contact
Learn more about TubeFiler
Introduction The webdemo shows a predefined hierarchy of 46 categories filled with testvideos, which were automatically assigned to one or more categories. Generally speaking, given a video, the system computes scores for each category and places the clip according to these scores. This decision is made automatically using an analysis of tags, title, and visual content. The result is then visualized in a browsing environment.


System Description Categorization by Tags
Videos in YouTube are usually associated with titles or tags, which contain valuable semantic clues about a video's content. Hence, our approach puts strong weight on classification from such meta-information. For each category, a two-class linear kernel SVM classifier [Schoelkopf01] is applied to bag-of-word features. These are normalized and scaled by each word's Inverse Category Frequency (IDF) to account for the fact that words occurring in all categories are too unspecific. This approach proved to be superior over other combinations with RBF kernels and unweighted features in previous tests.

During training each category of the hierarchy is interpreted as a document in the sense of the IDF measure. Each category is represented by a bag-of-words model containing all titles and tags from its videos w.r.t the ground truth. During evaluation, each video's meta-information is pre-processed in the same way as described above and scored by each subcategory's SVM.

Categorization by Visual Features
The visual categorization is based on a concept detection systems as introduced in [Ulges09]. This system uses the well-known bag-of-visual-words model [Sivic03], which is a successful approach in related recognition tasks such as object category recognition and concept detection. Clips are represented by keyframes, from which SIFT features [Lowe04] are extracted and matched with a visual codebook of 4,000 clusters. The resulting features are fed to category-specific binary SVMs. Finally, SVM scores (mapped to class posteriors) are averaged over all keyframes.

Deep-level Clustering
We refine genres further using an unsupervised clustering of videos within a category. For this purpose, we employ PLSA [Hofmann01], which has been developed in the text domain to identify latent topics in document collections. We apply PLSA to our visual words instead of words, obtaining clusters of visually similar content.


Experiments Genres & Dataset
A set of ca. 4,600 videos (100 for each category) was selected randomly from the entire database structured as below:

Hierarchy Node  
Video Download
Keyframes Extracted
  • Sport
    • soccer
    • basketball
    • swimming
    • golf
    • racing
    • snooker
    • tennis
    • ice hockey
    • athletics
  • 487
  • 463
  • 481
  • 490
  • 470
  • 494
  • 486
  • 485
  • 477
  • 497
  • 20770
  • 16548
  • 20460
  • 9524
  • 9714
  • 16925
  • 20576
  • 17483
  • 12655
  • 8874
  • News
    • weather
    • interview
    • politics
    • finance
  • 472
  • 472
  • 480
  • 494
  • 487
  • 29602
  • 11704
  • 23694
  • 19920
  • 15076
  • Music
    • Rock
    • RnB / HipHop
    • Electro
    • Jazz / Blues
    • Classic
  • 346
  • 390
  • 444
  • 469
  • 461
  • 444
  • 16050
  • 17873
  • 19260
  • 19260
  • 17406
  • 15025
  • Movies
    • Horror
    • Sci-Fi
    • Action
    • Animation
    • Crime
    • Thriller
    • Musical
    • Bollywood
    • Western
  • 467
  • 469
  • 468
  • 465
  • 478
  • 467
  • 427
  • 479
  • 413
  • 452
  • 25199
  • 23541
  • 13728
  • 12835
  • 17743
  • 14383
  • 13508
  • 20616
  • 19380
  • 16848
  • People / Blog
    • Videoblog
    • Funny
    • Cute
  • 483
  • 465
  • 463
  • 480
  • 6964
  • 5988
  • 5754
  • 5924
  • Travel
    • City
    • Summer & Beach
    • Adventure
    • Winter & Snow
  • 491
  • 489
  • 494
  • 478
  • 489
  • 22805
  • 25033
  • 18090
  • 22949
  • 16969
  • TV Show
    • game
    • talk
    • comedy
    • sci-fi
    • doctor shows
  • 475
  • 484
  • 477
  • 485
  • 454
  • 469
  • 22056
  • 20268
  • 22480
  • 18506
  • 18367
  • 16328
To avoid a bias in the evaluation, the tags used for downloading a clip e.g. "city", "sightseeing", "trip" were ignored. For the same reason, training and test data were split by upload time, simulating the effect of a system trained in the past (more precisely, before Dec 15th 2008) applied to future data. This resulted in a training set of 3,502 clips and a test set of 1,098 clips.

Results
For each test clip, we the system suggest the N best categories. If the correct category was among the suggestions, we count the clip as a exact match. We also evaluate a soft match, in which supercategories (e.g. sport) count as additional hits for subcategories (e.g. sport/soccer). Fig. 1 and Fig. 2 plots the hit rate for both measures vs. the number of suggestions N. Our system is also compared to a system based on random guessing.



The system achieved its best result when combining textual and visual information. For example, when suggesting 5 categories, it gives a hit rate of 91.8% (Fig. 1). It hits the exact category in 80.9% (Fig. 2) of cases. We also evaluated the benefits of the different modalities (using soft matching). The tag-based approach gives 90.6%, which clearly outperforms a purely visual categorization (45.5%). However, in cases a video does not have any meta information associated with it, this might provide a valuable clue about the video category.



References
[Schoelkopf01] Bernhard Schoelkopf, Alexander Smola
Learning with kernels: support vector machines, regularization, optimization, and beyond
MIT Press, Cambridge, MA, USA, 2001

[Ulges09] Adrian Ulges, Markus Koch, Christian Schulze, Thomas Breuel
Learning TRECVID'08 High-Level Features from YouTube
TRECVID Workshop, 2008

[Sivic03] Josef Sivic, Andrew Zisserman
Video Google: A Text Retrieval Approach to Object Matching in Videos
International Conference on Computer Vision, 2003

[Lowe04] David Lowe
Distinctive Image Features from Scale-Invariant Keypoints
International Journal of Computer Vision, 2004

[Hofmann01] Thomas Hofmann
Unsupervised Learning by Probabilistic Latent Semantic Analysis
Machine Learning, 2008