Thursday, July 31, 2014

Evaluation of Text Detection

I compared the current version of our text detection with Jue's version. Becuase Jue's version outputs bounding boxes I also convert segmentations into bounding boxes to be comparable. I count the number of true positives, false positives, and missed elements and use them to compute precision and recall.

I split the data into 34 training  and 73 test images. the training images are smaller and faster to process because I needed images that could run quickly. I ran the algorithms on both training and testing sets although the split is irrelevant to Jue's text detection version. I show Jue's accuracy on my training set to help us assess the inherent difficulty of my training set.

Here are the visual results:
https://dl.dropboxusercontent.com/u/20022261/reports/text_segmentation_benchmark.html

Testing

Amin's code (Testing) : Precision = 0.614, Recall = 0.692
Jue's code  (Testing) : Precision = 0.714, Recall = 0.453

Training

Amin's code (Training): Precision = 0.902, Recall = 0.830
Jue's code  (Training): Precision = 0.833, Recall = 0.600
(Jue's code has not used these images for training, this is just to compare the difficulty of training set)

All images including Training and Testing

Amin's code (All)     : Precision = 0.671, Recall = 0.724
Jue's code  (All)     : Precision = 0.745, Recall = 0.487

Comparison of my training and testing performance:

Amin's code (Training): Precision = 0.902, Recall = 0.830
Amin's code (Testing) : Precision = 0.614, Recall = 0.692




Wednesday, July 16, 2014

Meeting on Wednesday July 16th


We have a user interface to provide per pixel labels for text regions. We have annotated 107 images so far. Both our annotation and text detection output group text regions into text lines. 

The next step is to come up with a measure for text detection accuracy. We discussed a few measures:
  1. Ignore text group labels to get a zero-one map of pixels that indicates text pixels. Compute intersection area, false positive area, and missed area.
  2. Normalize according to text size
  3. In a graph framework, measure the similarity in connected component graph rather than in pixels. Measure the intersection of the edges that are on and off.
  4. To take into account text group labels as well, perform a bipartite matching between text regions and define the similarity of each group with its corresponding group according to one of the above measures.
  5. http://en.wikipedia.org/wiki/Rand_index is the combination of the above techniques.
We also discussed strategies to group connected component:
  1. We will focus on graph partitioning for now even though having overlapping regions may later improve performance. For graph partitioning we have a number of techniques:

  2. One could merge components in a pairwise fashion.
  3. One could think of a ransac style technique that scores text proposals. Those text proposals could be potentially overlapping.
We decided to use CMA  on MATLAB to optimize algorithm parameters for now.

Friday, July 11, 2014

Meeting on Friday July 11th - Brainstorming on learning techniques

Our hand-tuned rules for merging components work relatively good. However, there is a lot of room for improvements by learning. We discussed the following learning ideas:

1- Using CMA to fine-tune algorithm parameters. (while keeping the same rules)
2- Boosting decision trees to learn actual new rules.
3- A cascade framework that goes back and forth between grouping and classification stages.
4- A framework to merge two components at a time. (This could be use alongside other ideas)
5- A graph algorithm that can encorporate both encourage and discourage two components to be merged.

The first step is to generate per pixel training data. Then we planned to use CMA to update already trained weights.

Aaron added a few other helpful steps:
1. Learn classifiers to predict which clusters can be merged, and use their outputs as features in clustering
2. Learn per-cluster semantic labeling classifiers, use results as features for merging
3. Semantic labeling formulations (e.g., cascades, decision tree fields). We should skip this for now since it doesn’t provide clustering.

Hand-tuned segmentation system for graphic design


Here is our current pipeline to segment out text and graphic elements from graphic designs:

0- input image



1- Cluster pixels according to color (using k-means)




2- Extract connected components
Here are n=63 different connected components extracted.


3- Construct a graph with n nodes. Add edges according to some hand-defined criteria that includes:
Pairwise distance of nodes
Similarity of stroke width
Similarity in x and y coordinates separately
Whether the nodes surround each other
Color similarity
and a combination of these.

(A piece of code to generate the graph is at the bottom of the page.)

Here is a connectivity matrix:


4- Compute connected components in this graph


5- Classify the resulting larger components according to some criteria to determine text components.
6- Remove text components as follows:

7- To determine graphic elements, follow steps 3, 4, 5 that are tailored for graphic elements this time. Here is an output graphic segmentation:






Here is a piece of code to generate the graph in step 3:

g=false(nCC,nCC);
g=g| colordistfg<0.15 & yiou>0.3 & swthi>0.5 & xgap<n/20 & comp_edge==0 & comp_edge^2==1;
g=g| colordistfg<0.45 & yiou>0.3 & swthi>0.4 & xgap<n/20 & comp_edge==0 & comp_edge^2==1;
g=g| colordistfg<0.15 & yiou>0.0 & swthi>0.6 & xgap<n/40 & comp_edge==0 & comp_edge^2==1; 
g=g| colordistfg<0.05 & yiou>0.2 & swthi>0.5 & xgap<n/20 & comp_edge==0 & siblings;
g=g| colordistfg<1.20 & yiou>0.7 & swthi>0.8 & xgap<n/40 & comp_edge==0 & siblings;
g=g| colordistfg<0.55 & siblings & comp_edge==1;

% Each of the following is an nxn matrix:
% colordistfg: color distance
% yiou: intersection over union along y direction
% swthi: histogram intersection of stroke width transform
% xgap: distance along x direction
% comp_edge: wether the two components touch
% siblings: they have a common component that surrounds both

Tuesday, July 8, 2014

Meeting on Tuesday, July 8th

After reporting the progress to Aaron we discussed:

  1. We need a system that is able to detect text, graphic elements and photographs.
  2. Our first priority is to get k-means and text detection running properly.
  3. To segment photo elements we can look at kmeans error and natural image statistics.
  4. We need to look at graphic design grouping in a systematic way. We may need to train using hundreds or thousands of images.
  5. The audience are most interested in the applications.




Monday, July 7, 2014

Text detection pipeline


  1. Partition image pixels using k-means
  2. Extract connected components
  3. For each connected component extract the following features:
    1. Color histogram  and a measure of consistency
    2. Stroke Width Transform histogram and a measure of consistency
    3. Number of holes in a component
  4. For every pair of connected components compute the following similarity metrics:
    1. Color Similarity
    2. Stroke Width Similarity
    3. Pixel distance in x and y directions separately
    4. Graph distance between every pair of elements (this means how many times you need to move from one component to a neighboring component to reach from component A to component B).
  5. Merge component according to the similarity metrics. We have a hand-learned criteria.
  6. Classify each component according to whether it looks like text or not. (We have a hand-set criteria.) Use the following features for text classification:
    1. Area (the number of pixels)
    2. Stroke Width Consistency
    3. Width/Height
    4. relative Stroke Length: Area/(Stroke_Width^2)
    5. Sparsity
  7. Display text regions together with their bounding boxes

Here are the outputs:
https://dl.dropboxusercontent.com/u/20022261/reports/text_segmentation_benchmark.html