Semi Supervised MACHINE Learning





By Ms. Pankaja Alappanavar



The blog talks about Semi supervised learning and gives a short literature survey of the paper - "Extracting Patterns and Relations from World Wide Web" by Sergey Brin which employs the semi supervised learning method.

Semi supervised learning is used when a large amount of data is unlabeled and some data is labelled. The amount of unlabeled data is more than labelled one. Sometimes the cost of obtaining labelled data is high and hence we must make use of semi supervised data. In semi supervised data few labelled data is given as input to the learner and the output produced would be used to train the learner. In doing so, a pattern is found and using this pattern we can extract data. In the paper titled "Extracting Patterns and Relations from World Wide Web" by Sergey Brin, the author talks about extracting relations using the semi supervised technique from the World Wide Web whose data is widely distributed, present in different formats and is unstructured. There is a need for this unstructured data to be presented in structured format so that it would become useful for diverse sections.

The Goal of the author was to discover information sources and then extract relevant information from the information sources either automatically or with very minimal human intervention.

Method:

1. Initially seed (examples) of number (five) was considered. The example that was taken into consideration was extracting the relation of books-author from the web.

2. Using this seed the World Wide Web was searched for the occurrences of books and the authors.

3. Each occurrence i.e. presence of author and book is recognized to be a pattern. Patterns were defined as a 5 tuples. The order was of Boolean value and the others were strings. The order corresponded to the order the title and author occurred in the text. The url was the URL of the document they occurred on. The prefix were m characters (in tests it was 10) preceding the author/title which ever occurred first. The middle is the text between the author and title .The suffix was of m characters following title/author.

4. Using these patterns new book-author pairs were discovered.

5. These patterns were then used to discover more patterns and extract relevant information.

The method that was used was called DIPRE-Dual Iterative Pattern Relation Extraction. Regular expressions were used for the patterns.



Result:

The first 5 seeds produced 199 occurrences and generated 3 patterns. The final three patterns were produced by the first two books because they were sci-fictions. A run of these patterns produced 4047 unique author-title pairs. For final iterations they used a repository of work books which consisted of 156,000 documents which in turn produced 9938 occurrences and which generated 346 patterns. This which initially started with a set of 5 books expanded to 15,000 books. The author mentioned that the same tool could be used to other domains like music, restaurants etc. A more sophisticated version of the tool would like it to extract people directories, product catalogues etc.