Thursday, August 7, 2014

Text Classification on ICDAR

We evaluated our text detection algorithm on ICDAR benchmark. More information can be found here: http://dag.cvc.uab.es/icdar2013competition

We run our algorithm on Challenge 1, Task 2.  Our text detection results can be found in this webpage:
https://dl.dropboxusercontent.com/u/20022261/reports/text_segmentation_benchmark_ICAR.html

The baseline excludes the characters that are merged as missed results. However, we have no interest in making that computation. We compute precision and recall given:

Precision=tp/(tp+fp)
Recall= tp/(tp+mr)

where tp is true positive count, fp is false positive and mr is missed rate. In that case the accuracy of the other algorithms would be as follows:


  precision   recall   f-measure
    0.9252    0.9515    0.9384
    0.8566    0.9094    0.8830
    0.8544    0.8689    0.8617
    0.7708    0.9737    0.8722
    0.8248    0.9638    0.8943
Ours
    0.9083    0.8837    0.8960


Format to read the segmentation of a graphic design

For each image named filename.png I will make a text file named filename.txt with the following format that would keep the detection information:

4
480 640
345  TL 71231 71232 71234 ...
509  TL 87643 87723 87724 ...
5675 GD 35443 35444 35445 ...
6578 IM 45064 45065 45066 ...

The first line indicated the number of elements. The second line indicates the height and the width of the image. Each line after the text line describes an elements. Entries in each line are separated with a space.

The first entry in each line indicates N, the number of pixels in that segment, the second line indicates the type of the segment (TL: Text Line, GD: Graphic Design, IM: Natural Image). This is followed by N integers each indicating the position of a pixel in the segment. These integers are 1-based linear indicators of the pixels. For example, in a 640x480 (horizontal) image, the top left corner is 1, the bottom right corner is 640*480=307200, the bottom left corner is 480, and the top right corner is 640*480-480+1=306721.

Meeting on August 6th, 2014

After seeing the results for Text detection, we discussed the following points during our meeting:

1- Aseem presented a fuzzy matching technique. According to his technique, the elements of a graphic design are finalized after a query is presented. He proposed a graph matching technique (with MRF flavor) that performs a matching between query elements and the graphics elements. Some graphics elements could be merged or discarded during this process.

2- We discussed about how to wrap up the work and discussed the folowing TODO list. 

TODO:
1- Finalize Text detection and report it on ICDAR dataset.
2- Finalize Graphics Annotation
3- Perform training on Graphics Detection
4- Run the algorithm on a larger dataset
5- Write up a description of how our system works
6- Write up technical details of the algorithm for filing a patent.
7- Make a presentation and a poster for the last two days
8- Clean up the code so that it is more readable.
9- Put the detection pieces back in the pipeline and make a final demo.
10- Output detections in an easily readable format that Aseem could later experiment with.

Thursday, July 31, 2014

Evaluation of Text Detection

I compared the current version of our text detection with Jue's version. Becuase Jue's version outputs bounding boxes I also convert segmentations into bounding boxes to be comparable. I count the number of true positives, false positives, and missed elements and use them to compute precision and recall.

I split the data into 34 training  and 73 test images. the training images are smaller and faster to process because I needed images that could run quickly. I ran the algorithms on both training and testing sets although the split is irrelevant to Jue's text detection version. I show Jue's accuracy on my training set to help us assess the inherent difficulty of my training set.

Here are the visual results:
https://dl.dropboxusercontent.com/u/20022261/reports/text_segmentation_benchmark.html

Testing

Amin's code (Testing) : Precision = 0.614, Recall = 0.692
Jue's code  (Testing) : Precision = 0.714, Recall = 0.453

Training

Amin's code (Training): Precision = 0.902, Recall = 0.830
Jue's code  (Training): Precision = 0.833, Recall = 0.600
(Jue's code has not used these images for training, this is just to compare the difficulty of training set)

All images including Training and Testing

Amin's code (All)     : Precision = 0.671, Recall = 0.724
Jue's code  (All)     : Precision = 0.745, Recall = 0.487

Comparison of my training and testing performance:

Amin's code (Training): Precision = 0.902, Recall = 0.830
Amin's code (Testing) : Precision = 0.614, Recall = 0.692




Wednesday, July 16, 2014

Meeting on Wednesday July 16th


We have a user interface to provide per pixel labels for text regions. We have annotated 107 images so far. Both our annotation and text detection output group text regions into text lines. 

The next step is to come up with a measure for text detection accuracy. We discussed a few measures:
  1. Ignore text group labels to get a zero-one map of pixels that indicates text pixels. Compute intersection area, false positive area, and missed area.
  2. Normalize according to text size
  3. In a graph framework, measure the similarity in connected component graph rather than in pixels. Measure the intersection of the edges that are on and off.
  4. To take into account text group labels as well, perform a bipartite matching between text regions and define the similarity of each group with its corresponding group according to one of the above measures.
  5. http://en.wikipedia.org/wiki/Rand_index is the combination of the above techniques.
We also discussed strategies to group connected component:
  1. We will focus on graph partitioning for now even though having overlapping regions may later improve performance. For graph partitioning we have a number of techniques:

  2. One could merge components in a pairwise fashion.
  3. One could think of a ransac style technique that scores text proposals. Those text proposals could be potentially overlapping.
We decided to use CMA  on MATLAB to optimize algorithm parameters for now.

Friday, July 11, 2014

Meeting on Friday July 11th - Brainstorming on learning techniques

Our hand-tuned rules for merging components work relatively good. However, there is a lot of room for improvements by learning. We discussed the following learning ideas:

1- Using CMA to fine-tune algorithm parameters. (while keeping the same rules)
2- Boosting decision trees to learn actual new rules.
3- A cascade framework that goes back and forth between grouping and classification stages.
4- A framework to merge two components at a time. (This could be use alongside other ideas)
5- A graph algorithm that can encorporate both encourage and discourage two components to be merged.

The first step is to generate per pixel training data. Then we planned to use CMA to update already trained weights.

Aaron added a few other helpful steps:
1. Learn classifiers to predict which clusters can be merged, and use their outputs as features in clustering
2. Learn per-cluster semantic labeling classifiers, use results as features for merging
3. Semantic labeling formulations (e.g., cascades, decision tree fields). We should skip this for now since it doesn’t provide clustering.

Hand-tuned segmentation system for graphic design


Here is our current pipeline to segment out text and graphic elements from graphic designs:

0- input image



1- Cluster pixels according to color (using k-means)




2- Extract connected components
Here are n=63 different connected components extracted.


3- Construct a graph with n nodes. Add edges according to some hand-defined criteria that includes:
Pairwise distance of nodes
Similarity of stroke width
Similarity in x and y coordinates separately
Whether the nodes surround each other
Color similarity
and a combination of these.

(A piece of code to generate the graph is at the bottom of the page.)

Here is a connectivity matrix:


4- Compute connected components in this graph


5- Classify the resulting larger components according to some criteria to determine text components.
6- Remove text components as follows:

7- To determine graphic elements, follow steps 3, 4, 5 that are tailored for graphic elements this time. Here is an output graphic segmentation:






Here is a piece of code to generate the graph in step 3:

g=false(nCC,nCC);
g=g| colordistfg<0.15 & yiou>0.3 & swthi>0.5 & xgap<n/20 & comp_edge==0 & comp_edge^2==1;
g=g| colordistfg<0.45 & yiou>0.3 & swthi>0.4 & xgap<n/20 & comp_edge==0 & comp_edge^2==1;
g=g| colordistfg<0.15 & yiou>0.0 & swthi>0.6 & xgap<n/40 & comp_edge==0 & comp_edge^2==1; 
g=g| colordistfg<0.05 & yiou>0.2 & swthi>0.5 & xgap<n/20 & comp_edge==0 & siblings;
g=g| colordistfg<1.20 & yiou>0.7 & swthi>0.8 & xgap<n/40 & comp_edge==0 & siblings;
g=g| colordistfg<0.55 & siblings & comp_edge==1;

% Each of the following is an nxn matrix:
% colordistfg: color distance
% yiou: intersection over union along y direction
% swthi: histogram intersection of stroke width transform
% xgap: distance along x direction
% comp_edge: wether the two components touch
% siblings: they have a common component that surrounds both

Tuesday, July 8, 2014

Meeting on Tuesday, July 8th

After reporting the progress to Aaron we discussed:

  1. We need a system that is able to detect text, graphic elements and photographs.
  2. Our first priority is to get k-means and text detection running properly.
  3. To segment photo elements we can look at kmeans error and natural image statistics.
  4. We need to look at graphic design grouping in a systematic way. We may need to train using hundreds or thousands of images.
  5. The audience are most interested in the applications.




Monday, July 7, 2014

Text detection pipeline


  1. Partition image pixels using k-means
  2. Extract connected components
  3. For each connected component extract the following features:
    1. Color histogram  and a measure of consistency
    2. Stroke Width Transform histogram and a measure of consistency
    3. Number of holes in a component
  4. For every pair of connected components compute the following similarity metrics:
    1. Color Similarity
    2. Stroke Width Similarity
    3. Pixel distance in x and y directions separately
    4. Graph distance between every pair of elements (this means how many times you need to move from one component to a neighboring component to reach from component A to component B).
  5. Merge component according to the similarity metrics. We have a hand-learned criteria.
  6. Classify each component according to whether it looks like text or not. (We have a hand-set criteria.) Use the following features for text classification:
    1. Area (the number of pixels)
    2. Stroke Width Consistency
    3. Width/Height
    4. relative Stroke Length: Area/(Stroke_Width^2)
    5. Sparsity
  7. Display text regions together with their bounding boxes

Here are the outputs:
https://dl.dropboxusercontent.com/u/20022261/reports/text_segmentation_benchmark.html

Wednesday, June 25, 2014

Stroke Width Transform using Generalized Distance Transform

The original Stroke Width Transform algorithm is described in the following paper: 
Detecting Text in Natural Scenes with Stroke Width Transform

There are two major implementations available:
1- The version Jue's matlab implementation uses from bresenham_swt.m

2- A version provided by a few Cornell students: https://github.com/aperrau/DetectText

This algorithm starts by edge detection. Then from each edge pixel it traces rays prependicular to the edge orientation at that pixel and tried to hit another edge. The distance between the two edge pixels give an estimate of stroke width.My implementation of Stroke Width Transform is different. Instead of ray tracing I use Generalized Distance Transform. The core of my code is three MATLAB lines:

D = DT(img);
[~, R]= DT(-D);
swt=sqrt(D(R));

DT is generalized distance transform function from here. D is distance transform and R is the assignment.

Other than its simplicity this implementation has several advantages:
1- Its complexity is O(n) where n is the number of pixels
2- The input can be continues grayscale values. The algorithm takes advantage of anti-aliasing matt around strokes to provide a more accurate stroke width estimate.
3- The output is smoother. There are some holes in the original SWT implementation that are fixed here.
Here is a visual comparison of my implementation with the old techniques.
https://dl.dropboxusercontent.com/u/20022261/reports/stroke_width_transform_benchmark.html

The first column contains the original images. The second column is a segmentation using rgb k-means. Please note that the boundary matt is preserved. The third column is the output of my SWT implementation. The forth column is the code Jue's students currently use. The Forth column is an implementation from Cornell students. The last 




Tuesday, June 24, 2014

K-means for Segmentation

Aseem suggested a simple experiment: Running K-means on rgb values of pixels. The initial segmentation outcome was promising. Two major problems that were apparent in the output was as follows:

1- Boundary pixels may be assigned to wrong clusters due to anti-aliasing.
2- Two elements with the same color would fall into one cluster.

Boundary clean-up

I tried a few techniques to fix the first problem. I ended up doing the following that gave the best result:

1- Before running k-means to determine cluster centers, I excluded edge pixels from k-means to make sure they do not form a separate cluster.
2- For the pixels that fall close to a segment boundary, I perform a soft assignment process. The output end up having a clean boundary matt.

Here are the results after cleaning boundary matt:

https://dl.dropboxusercontent.com/u/20022261/reports/segmentation_using_kmeans.html

Second level of segmentation 

In images that are segmented using only colors, seperate design elements tend to fall into the same cluster because they have the same color. We discussed two approaches to fix this problem:

1- Add certain features to RGB features to improve k-means clustering. One could hope that improved features would help improve clusters that k-means would generate. We discussed Stroke Width Transform, Texture features (LBP and local image statistics) and spatial features. We also discussed a metric learning process based on stochastic gradient descent to learn weights for features.

2- Separate disjoint elements of one cluster and join them up according to their appearance. We thought it is reasonable to process text elements separately using specialized features that could include Stroke Width Transform.

We decided to investigate the second idea first. I began implementing a basic version of Stroke Width transform that fits our application.



Friday, June 20, 2014

Segmentation using Edge Detection

There are certain differences between graphic design and natural images that affect the goal and hints for segmentation:




Graphic Design
Natural Image
Distinct segment boundaries
Frequent
Rare
Color consistency in segments
High
Low
Disjoint Elements
Many
Few



We realized that a mean-shift based segmentation algorithm does not produce appropriate results on graphic design.

We decided to try segmentation using edge detection. Two techniques are proposed for now.

1- Run an Edge-Detector; find connected components in edges and fill up the gaps

2- Run an Edge-Detector; find connected components in non-edge pixels and call each a super-pixel


Update: I tried a few types of edge detectors. The problem is these edge detectors are not designed to generate regions, they generate edges that are often disjoint.

Here are the outputs of edge detection:

https://dl.dropboxusercontent.com/u/20022261/reports/edge_detection.html

Thursday, June 19, 2014

The problem of Segmentation and Grouping for Graphic Design

After apply segmentation techniques on graphic designs we realized a significant difference between the ways a graphic design or an image must be segmented. We are using natural image segmentation algorithms, however, our experiments show that some fairly simple heuristics give us better segmentation for graphic design than a natural image segmenter. There are several important factors that make a graphic design different from a natural image.

1- Graphic design elements have more consistent colors, shapes and boundaries.

2- Graphic design elements may consist of disjoint sub-elements that need to be grouped together. There is a body of psychology study on how to group elements in an image. Including:


Text segmentation

Stroke Width Transform

I compared three implementations of Stroke Width Transform.

1- The version Jue's matlab implementation uses from bresenham_swt.m
This version is designed for black and white images; so running them on grayscale images is irrelevant.

2- A version provided by a few Cornell students
http://stackoverflow.com/questions/4837124/stroke-width-transform-swt-implementation-java-c
This work is quick to run and provides very little false positives. However, a portion of true positive letters are also missed.

3- A version from CCV library
http://libccv.org/doc/doc-swt/
I don't suggest this library. It is hard to install but it does not provide a proper output.

You can see the results here. Because Jue's matlab implementation for swt was slow I ran it on only a few images.
https://dl.dropboxusercontent.com/u/20022261/reports/stroke_width_transform_benchmark.html

Using bounding-boxes to improve text segmentation

I applied some heuristics on our text-boxe annotations to segment out the text. I first determine whether the text is dark on light background or light on dark background. Then I do k-means on pixels with two components to get foreground and background.

Here are the results:
https://dl.dropboxusercontent.com/u/20022261/reports/text_foreground_benchmark.html

Next Steps

We planned to:
1- Use Jue's bounding boxes instead of annotation as the starting point for text segmentation.
2- Visualize the output on images
3- Use the output to determine which superpixels belong to text

Tuesday, June 17, 2014

Meeting on June 17th, 2014

We discussed the quality of superpixels here:

We brainstormed a few heuristics to improve superpixel segmentation and decided on the following ideas:

1- Get rid of low quality images
2- Remove compression noise using median filter
3- Use other superpixel algorithms such as TurboPixels
4- See if Jue's paper provides auxiliary data for text pixels
5- Heuristics to use text annotations to detect pixels belonging to the text
6- Hand annotate which superpixels must go together.


We noted that how superpixels relate to eachother matters. Probably we need various kinds of features including color, shape and stroke width transform feature. We also discussed ideas to use VOC style detectors and RCNN features but we planned to study them later on.

Monday, June 16, 2014

Plans for Computer Vision Analysis

Text Processing

We used text detection and OCR to extract text automatically. We use a text line detection code to generates bounding boxes for textlines. Each image takes about 1 minute to be processed. For OCR we use Tesseract. Text detection results are here:

Future Plan

Here is what we discussed. After we extract text, we can try removing the text and detect the major graphics elements by parsing the image. We can over-segment the image into many super-pixels. Then, using computer vision techniques we can group and classify super-pixels into elements such as graphics, photo or background. We can either extract bounding boxes from segments or generate a heat map or a feature representation that could be later used to improve search.



Background

We decided to work on techniques involving reverse engineering graphic designs. We discussed possible goals, possible techniques and some background work.

We started by investigating ways to parse a graphic design into its elements. Parsing graphic designs can be defined in various ways. We decided to start from a good application, then define the parsing task according to the requirements of the application.

We looked at Theo that helps beginner users to good graphical design. Theo is supposed to give suggestions to the user to be able to make a professional-looking graphic design.

Dataset

We made a dataset of 107 designs that are simple enough to experiment with. We annotated them using mechanical turk. (I did the annotations in a mechanical turk page.) We annotated text fields and graphics fields using bounding boxes.

Pipeline

A query input design would contain a few graphics and a few text elements. Given the query, the pipeline consists of two major steps:

1- Search for relevant designs for suggestion
2- Transfer the Layout, Fonts and Colors to user Query


Nearest Neighbor Search

We performed nearest neighbor search using a few different distance measures. Some measures consider the image as a whole while some others try to perform a binary matching between pairs of elements. We used the following measures:
1- Intersection over Union
2- Aspect ratio of elements
3- Color distribution
4- Density and Moments of elements
5- Saliency map

Transfer

Our goal is to transfer layout, fonts and colors from a database design to the query design. The elements in the query design are parsed; the extend of graphics and text elements are known. However, Database designs are raw JPG images; they have no annotation by default. We need to process database designs using MTurk and computer vision techniques. Our goal is to extract layout, fonts and colors as well as to certain auxiliary annotation that helps improving search.

We directly generate SVG files and view them in a browser.

Preliminary Results

Preliminary results show various limitations. One major contributing factor to the quality of results is the size of the design dataset, so we need to expand our dataset from 107 images to thousands of images. Another contributing factor is how well dataset images are processed.