wiki:BluePrint/Importer

Version 8 (modified by Nitin Rastogi, 12 years ago) ( diff )

--

Integrating/Developing a framework to extract structured data from web sources with a simple query language.

Some links that might be useful:

  • http://wiki.github.com/fizx/parsley/
  • http://developer.yahoo.com/yql/guide/
  • PDFminer is a tool to convert pdf docs into text, it is open source (Licence). Some hacking in the souce code will is a good option for coding IMPORTING TOOL Spreadsheet Importer by codestasher
  • Code snippet to extract hyperlinks from HTML docs.
    import sgmllib
    
    class MyParser(sgmllib.SGMLParser):
        
        def parse(self, s):
            self.feed(s)
            self.close()
    
        def __init__(self, verbose=0):
            sgmllib.SGMLParser.__init__(self, verbose)
            self.hyperlinks = []
    
        def start_a(self, attributes):
            for name, value in attributes:
                if name == "href":
                    self.hyperlinks.append(value)
    
        def get_hyperlinks(self):
            return self.hyperlinks
    
    import urllib, sgmllib
    
    f = urllib.urlopen("http://www.python.org")
    s = f.read()
    
    
    myparser = MyParser()
    myparser.parse(s)
    
    
    print myparser.get_hyperlinks()
    
    

by codestasher

The Spreadsheet Importer will be a component of this.

But it would also be good to be able to import from the following formats:

  • PDF
  • HTML (File/URL)
  • DOC
  • XML (Not matching out data schema)
  • RSS
  • News feeds
  • Ushahidi
  • Incoming SMS

Some of these formats will be able to be parsed and imported, others may be unstructured and saved as a "New Feed".
Some of the data may be tabular, or just single record.
Ideally different templates will be able to be designed (by users) for importing different types of data. Machine learning algorithms with (multiple?) human verification could try parsing new data formats based on previous templates used.

Note: See TracWiki for help on using the wiki.