Welcome to the World of Technology Opportunities
Check below for Current Opportunities
- EMAIL US @ : PRONURTURE.IT@GMAIL.COM
Semi Supervised MACHINE Learning
By Ms. Pankaja Alappanavar
The blog talks about Semi supervised learning and gives a short literature survey of the paper - "Extracting Patterns and Relations from World Wide Web" by Sergey Brin which employs the semi supervised learning method.
Semi supervised learning is used when a large amount of data is unlabeled and some data is labelled. The amount of unlabeled data is more than labelled one. Sometimes the cost of obtaining labelled data is high and hence we must make use of semi supervised data. In semi supervised data few labelled data is given as input to the learner and the output produced would be used to train the learner. In doing so, a pattern is found and using this pattern we can extract data. In the paper titled "Extracting Patterns and Relations from World Wide Web" by Sergey Brin, the author talks about extracting relations using the semi supervised technique from the World Wide Web whose data is widely distributed, present in different formats and is unstructured. There is a need for this unstructured data to be presented in structured format so that it would become useful for diverse sections.
The Goal of the author was to discover information sources and then extract relevant information from the information sources either automatically or with very minimal human intervention.
Method:
1. Initially seed (examples) of number (five) was considered. The example that was taken into consideration was extracting the relation of books-author from the web.
2. Using this seed the World Wide Web was searched for the occurrences of books and the authors.
3. Each occurrence i.e. presence of author and book is recognized to be a pattern. Patterns were defined as a 5 tuples
Result:
The first 5 seeds produced 199 occurrences and generated 3 patterns. The final three patterns were produced by the first two books because they were sci-fictions. A run of these patterns produced 4047 unique author-title pairs. For final iterations they used a repository of work books which consisted of 156,000 documents which in turn produced 9938 occurrences and which generated 346 patterns. This which initially started with a set of 5 books expanded to 15,000 books. The author mentioned that the same tool could be used to other domains like music, restaurants etc. A more sophisticated version of the tool would like it to extract people directories, product catalogues etc.