Thursday, January 8, 2009

Evolution of the updating the cache in vmfeed module.

Remote packages in ECP are managed through an extension module called vmfeed. This module is a core extension module and is distributed along with the application. Remote repositories are essentially RSS feeds that are read by vmfeed and each entry is then updated in the local database (the cached entries).

Within the vmfeed extension module there is a RepoFeed class that represents an installed repository. The RepoFeed.update_cache() method is responsible for reading the feed XML, and updating the database with each entry that is found. Here is what the ECP 2.1 version of the method looks like.
#ECP 2.1 version of RepoFeed.update_cache()

def update_cache(self):
"""This method will update the cache (the RepoEntry rows) with the new
versions of all of the data in the database.
@param self: The method class.
@type self: L{vmfeed.model.RepoFeed}
@return: None
@rtype: None
@raise None: No exceptions are raised by this method.
@status: Stable
@see: L{vmfeed.model.RepoFeed.validate_enclosure}"""
if not self.cache:
return False
self.retrieved_on=datetime.datetime.now()
feed=ET.fromstring(self.cache)
e2_log('Got a good feed %s'%feed, location=__name__)
feedname=None
feedname=feed.find('channel/title')
if feedname!=None and feedname.text.strip()!="":
e2_log('Updating feed name to %s'%feedname, location=__name__)
self.name=feedname.text.strip()
description=None
description=feed.find('channel/description')
if description!=None and description.text.strip()!="":
e2_log('Updating description to %s'%description,\
location=__name__)
self.description=description.text.strip()
items=feed.findall('channel/item')
if not len(items):
return 0
for i in items:
name=i.find('title').text.strip()
description=i.find('title').text.strip()
try:
description=i.find('description').text.strip()
except:
pass
U=None
U=i.find('uuid')
if U!=None:
U=U.text.strip().lower()
else:
U=gen_uuid()
enclosure=None
enclosures=i.findall('enclosure')

#Only the first enclosure matters, unless it is an egg, in which
#case we need to look at ALL of them and get the matching python
#release version.
#This should all get refactored actually...
for enclosure in enclosures:
e2_log('Enclosure is %s'%enclosure.attrib,\
location=__name__)
if enclosure!=None:
mime=self.enclosure_2_mime(enclosure)
if not mime:
enclosure=None
continue
elif not self.validate_enclosure(enclosure,mime):
enclosure=None
continue #Only known mime types get stored.
else:
break;

if enclosure!=None:
#mime=self.enclosure_2_mime(enclosure)
if not mime:
continue #Only known mime types get stored.
e2_log('Found an enclosure %s'%enclosure.attrib['url'],\
location=__name__)
url=enclosure.attrib['url']
url=self.normalize_url(url)
try:
if U:
re=RepoEntry.by_uuid(U)
else:
re=RepoEntry.by_url(url)
except:
re=RepoEntry(url=url,\
name=name,\
description=description,\
feed=self)
re.set(description=description,\
name=name,\
url=url,\
retrieved_on=datetime.datetime.now(),\
mime=mime,\
uuid=U,\
)
re.sync()
#re.retrieved_on=datetime.datetime.now()
#re.description=description
#re.mime=mime
#re.uuid=U
else:
e2_log('No enclosures found in entry %s'%name,\
location=__name__)
if enclosure:
e2_log('Enclosure had attribs %s'%enclosure.attrib,\
location=__name__)
return 1
The success of the methods' execution is based on the return value. This means that when the method fails, the invoking process is given no useful information when the method fails.

The main problem the ECP development team found with this method is that it is not very cohesive. The responsibilities of this method are very broad:
  • Parse XML
  • Initialize repository entry parameters
  • Iterate through item elements (while performing XML operations)
  • Iterate through enclosure elements (while performing XML operations)
  • Check if the repository entry exists and create it if not.
Finally, there is excessive logging that doesn't help with the complexity.

Here is a taste what what the ECP 2.2 version of the same method will look like.
#ECP 2.2 version of RepoFeed.update_cache()

def update_cache(self, tx=None):
"""This method will update the cache (the RepoEntry rows) with the new
versions of all of the data in the database.
@param self: The method class.
@type self: L{vmfeed.model.RepoFeed}
@return: None
@rtype: None
@raise None: No exceptions are raised by this method.
@status: Stable
@see: L{vmfeed.model.RepoFeed.validate_enclosure}"""
self.retrieved_on=datetime.datetime.now()
feed_xml=get_element(self.cache)
try:
name=get_element_text(feed_xml, element='channel/title')
self.name=name.strip()
except AttributeError:
pass
try:
desc=get_element_text(feed_xml, element='channel/description')
self.description=desc.strip()
except AttributeError:
pass
for item in VMFeedTools.get_items_xml(feed_xml):
try:
name=get_element_text(item, element='title')
name=name.strip()
except AttributeError:
name=None
try:
desc=get_element_text(item, element='description')
desc=desc.strip()
except AttributeError:
desc=None
try:
item_uuid=get_element_text(item, element='uuid')
item_uuid=item_uuid.strip().lower()
except AttributeError:
item_uuid=gen_uuid()
enclosure=VMFeedTools.get_valid_enclosure_xml(item)
url=get_element_property(enclosure, None, 'url')
mime=VMFeedTools.enclosure_2_mime(enclosure)
try:
entry_obj=VMFeedTools.get_repo_entry(item_uuid)
except RepoEntryNotFound, e:
e.store_traceback()
entry_obj=RepoEntry(uuid=item_uuid,\
url=url,\
name=name,\
description=desc,\
retrieved_on=datetime.datetime.now(),\
mime=mime,\
feed=self)
else:
entry_obj.set(uuid=item_uuid,\
url=url,\
name=name,\
description=desc,\
retrieved_on=datetime.datetime.now(),\
mime=mime)
entry_obj.sync()