• Python Scrapy抓取数据中文存utf8乱码问题

    继 http://blog.wdoc.info/note/79.html (在回调函数中加载新页面XPath)之后

    最近出现了 网页GB2312编码,无法正常转换为utf8存入数据库的问题

     网上大肆搜寻一番,说的编码转换一一试过,似乎问题不太好解决

    于是直接在命令行下 使用python    >>>scrapy shell  http://my.test.url

     加载页面之后输出类容 来看了下 (截取了一部分)测试片段 如下

    >>>tt = "\xe6\x98\xaf\xe4\xb8\x80\xe5\xae\xb6\xe4\xb8\x93\xe4\xb8\x9a\xe9\x94\x80\xe5\x94\xae\xe3\x80\x8a\xe9\x93\x85\xe9\x85\xb8\xe5\x85\x8d\xe7\xbb\xb4\xe6\x8a\xa4\xe8\x93\x84\xe7\x94\xb5\xe6\xb1\xa0\xe3\x80\x8b\xe3\x80\x81\xe3\x80\x8aUPS\xe4\xb8\x93\xe7\x94\xa8\xe8\x93\x84\xe7\x94\xb5\xe6\xb1\xa0\xe3\x80\x8b\xe3\x80\x81\xe3\x80\x8a\xe7\x9b\xb4\xe6\xb5\x81\xe5\xb1\x8f\xe4\xb8\x93\xe7\x94\xa8\xe8\x93\x84\xe7\x94\xb5\xe6\xb1\xa0\xe3\x80\x8b\xe3\x80\x81\xe3\x80\x8aEPS\xe4\xb8\x93\xe7\x94\xa8\xe8\x93\x84\xe7\x94\xb5\xe6\xb1\xa0\xe3\x80\x8b\xe3\x80\x81\xe3\x80\x8a\xe5\xa4\xaa\xe9\x98\xb3\xe8\x83\xbd\xe4\xb8\x93\xe7\x94\xa8\xe8\x93\x84\xe7\x94\xb5\xe6\xb1\xa0\xe3\x80\x8b\xe3\x80\x81\xe3\x80\x8a\xe9\x93\x85\xe9\x85\xb8\xe5\x85\x8d\xe7\xbb\xb4\xe6\x8a\xa4\xe8\x93\x84\xe7\x94\xb5\xe6\xb1\xa0\xe3\x80\x8b\xe3\x80\x81\xe3\x80\x8a\xe8\x83\xb6\xe4\xbd\x93\xe8\x93\x84\xe7\x94\xb5\xe6\xb1\xa0\xe3\x80\x8b\xef\xbc\x8c\xe5\x8c\x85\xe6\x8b\xac\xe5\xae\x89\xe8\xa3\x85\xe3\x80\x81\xe8\xb0\x83\xe8\xaf\x95\xe5\x8f\x8a\xe7\xbb\xb4\xe4\xbf\xae..."

    >>>print tt  

    直接乱码

    >>>print tt.decode('utf8','ignore')

    显示正常中文(ignore 在解码过程中直接忽略异常编码)

    但是在Scrapy中

    temp = urllib2.urlopen(newurl) #请求
    temp = temp.read() #读数据

    temp = temp.decode('utf8','ignore') # 这样会提示找不到默认编码...
    newresponse = HtmlResponse(newurl)
    newresponse._set_body(temp)

    于是最终解决方案

    temp = temp.decode('utf8','ignore').encode('gbk') 

     

    #过程比较潦草,仅做记录  欢迎指正

    #如果对python 如何使用Scrapy加载网页不熟悉的朋友,请先参阅 http://blog.vsfor.com/note/79.html

     

  • Scrapy使用cx_Oracle插入数据时中文乱码问题

    使用Python的爬虫框架Scrapy时  往往需要将爬到的数据保存到数据库

    而这边用的是Oracle,环境配置之类的网上很多这边就不赘述了

    编码中涉及中文时 python文件开头添加 ##coding=utf-8 是必不可少的

    乱码解决:

    Linux 环境下,设置环境变量

    NLS_LANG=AMERICAN_AMERICA.AL32UTF8

     

    同理Windows环境下 乱码也跟这个没有设置的环境变量有关,但是并不是这样解决的

    Windows主要是通过os 模块修改环境变量,解决参考代码如下

    ##coding=utf-8
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: http://doc.scrapy.org/topics/item-pipeline.html
    import cx_Oracle
    import os
    os.environ['NLS_LANG']="AMERICAN_AMERICA.AL32UTF8"

     

    当然网络上也有 os.environ['NLSLANG'] = 'SIMPLIFIED CHINESECHINA.UTF8' 的配置方案,如有同样问题,就测测看吧:)

  • Yii webapp 常用配置

    每个人在开发的过程中需求可能不尽相同,具体需求配置 可以参考框架里面的代码注释

    废话就不多说了

    1、 Gii模块的配置

    // uncomment the following to enable the Gii tool
        'gii'=>array(
            'class'=>'system.gii.GiiModule',
            'password'=>false,//'Enter Your Password Here', //关闭密码验证
            // If removed, Gii defaults to localhost only. Edit carefully to taste.
            'ipFilters'=>false,//array('127.0.0.1','::1'), //关闭ip过滤
        ),

     

    2、数据库配置(多库配置属性参考)
    //详细db配置参数 见CDbConnection 类
    $config['components']['dbmysql'] = array(
        'class'=>'CDbConnection',
        'connectionString' => 'mysql:host=localhost;dbname=testdrive',
        'emulatePrepare' => true,
        'username' => 'root',
        'password' => '',
        'charset' => 'utf8',
    );
    $config['components']['dboracle'] = array(
        'class'=>'CDbConnection',
        'connectionString' => 'oci:dbname=//localhost:1521/testdb;charset=AL32UTF8',
        'emulatePrepare' => true,
        'username' => 'scott',
        'password' => 'tiger',
        'tablePrefix' => 'TB_',
        //'schemaCachingDuration' => 3600,
        //'schemaCachingExclude' => array(),
    );

     

    3、url 配置

    $config['components']['urlManager'] = array(
        'urlFormat'=>'path',
        'urlSuffix' => '.html', //后缀 伪静态
        'rules'=>array(
            '<controller:\w+>/<id:\d+>'=>'<controller>/view',
            '<controller:\w+>/<action:\w+>/<id:\d+>'=>'<controller>/<action>',
            '<controller:\w+>/<action:\w+>'=>'<controller>/<action>',
        ),
    );

     

    4、日志及调试信息配置

    $config['components']['log'] = array(
        'class'=>'CLogRouter',
        'routes'=>array(
            array(
                'class'=>'CFileLogRoute',
                'levels'=>'error, warning',
            ),
            // uncomment the following to show log messages on web pages
           
            array(
                'class'=>'CWebLogRoute',
                'categories'=>'system.db.CDbCommand', //只查看与数据库交互的相关信息
                //'levels'=>'',
            ),
           
        ),
    );

     

    待补充

  • Yii 多表关联取数据 relations 写法

    常用relations 写法参考:
    如: A为用户表, B为用户标签表, C为标签信息表
    数据表存储 结构
    table A    ID    NAME
    table B    ID    A_ID  TYPE_ID      
    table C    ID    VALUE
    其中A表中ID 与 B表中A_ID 关联 ( 1对n)
       B表中的TYPE_ID 与 C表中的ID 关联  (1对1)
    需求示例: 查询并显示用户名为 ***  的用户的 所有标签详细信息
    简要分析:查询条件 A.NAME = ***  ,最终希望获取到C表中的详细信息 ,  首先查询到用户ID,关联到B表中 查出此用户的 所有标签ID,通过标签ID获取到详细信息
    ModelA
      'relab'=>array(self::HAS_MANY , 'ModelB' , array( 'A_ID' => 'ID' ) , 'with'=>'relbc' )    //模型A中查询B表中的数据

    ModelB
      'relbc'=>array(self::BELONGS_TO , 'ModelC' , array( 'TYPE_ID'=> 'ID' ) )   //通过B模型 关联查询C表中的数据

    $result = ModelA -> find(‘NAME=:NAME’,array(':NAME' => '***' )) -> with('relab');
     
    最终调用:
    由于是一对多的关系,所以   $result -> relab   一般是一个对象数组,这样我们就不能直接用  $result -> relab -> relbc 来调用C表中的字段信息了
     不过foreach 循环一下就ok了, 具体情况具体操作吧 :)

    从B关联到C 为一一对应关系, 所以关联类型也可以使用 HAS_ONE,  不过在对应参数就需要作对应的小修改了
    //HAS_ONE写法  'relbc'=>array(self::HAS_ONE , 'ModelC' , array( 'ID'=> 'TYPE_ID' ) )

    //以下为framework/db/ar/CActiveRecord.php 中节选的部分注释,实际使用过程中可作参考
    This method should be overridden to declare related objects.

    There are four types of relations that may exist between two active record objects:
    <ul>
    <li>BELONGS_TO: e.g. a member belongs to a team;</li>
    <li>HAS_ONE: e.g. a member has at most one profile;</li>
    <li>HAS_MANY: e.g. a team has many members;</li>
    <li>MANY_MANY: e.g. a member has many skills and a skill belongs to a member.</li>
    </ul>

    Besides the above relation types, a special relation called STAT is also supported
    that can be used to perform statistical query (or aggregational query).
    It retrieves the aggregational information about the related objects, such as the number
    of comments for each post, the average rating for each product, etc.

    Each kind of related objects is defined in this method as an array with the following elements:
    <pre>
    'varName'=>array('relationType', 'className', 'foreignKey', ...additional options)
    </pre>
    where 'varName' refers to the name of the variable/property that the related object(s) can
    be accessed through; 'relationType' refers to the type of the relation, which can be one of the
    following four constants: self::BELONGS_TO, self::HAS_ONE, self::HAS_MANY and self::MANY_MANY;
    'className' refers to the name of the active record class that the related object(s) is of;
    and 'foreignKey' states the foreign key that relates the two kinds of active record.
    Note, for composite foreign keys, they can be either listed together, separated by commas or specified as an array
    in format of array('key1','key2'). In case you need to specify custom PK->FK association you can define it as
    array('fk'=>'pk'). For composite keys it will be array('fk_c1'=>'pk_c1','fk_c2'=>'pk_c2').
    For foreign keys used in MANY_MANY relation, the joining table must be declared as well
    (e.g. 'join_table(fk1, fk2)').

    Additional options may be specified as name-value pairs in the rest array elements:
    <ul>
    <li>'select': string|array, a list of columns to be selected. Defaults to '', meaning all columns.
      Column names should be disambiguated if they appear in an expression (e.g. COUNT(relationName.name) AS name_count).</li>
    <li>'condition': string, the WHERE clause. Defaults to empty. Note, column references need to
      be disambiguated with prefix 'relationName.' (e.g. relationName.age&gt;20)</li>
    <li>'order': string, the ORDER BY clause. Defaults to empty. Note, column references need to
      be disambiguated with prefix 'relationName.' (e.g. relationName.age DESC)</li>
    <li>'with': string|array, a list of child related objects that should be loaded together with this object.
      Note, this is only honored by lazy loading, not eager loading.</li>
    <li>'joinType': type of join. Defaults to 'LEFT OUTER JOIN'.</li>
    <li>'alias': the alias for the table associated with this relationship.
      It defaults to null,
      meaning the table alias is the same as the relation name.</li>
    <li>'params': the parameters to be bound to the generated SQL statement.
      This should be given as an array of name-value pairs.</li>
    <li>'on': the ON clause. The condition specified here will be appended
      to the joining condition using the AND operator.</li>
    <li>'index': the name of the column whose values should be used as keys
      of the array that stores related objects. This option is only available to
      HAS_MANY and MANY_MANY relations.</li>
    <li>'scopes': scopes to apply. In case of a single scope can be used like 'scopes'=>'scopeName',
      in case of multiple scopes can be used like 'scopes'=>array('scopeName1','scopeName2').
      This option has been available since version 1.1.9.</li>
    </ul>

    The following options are available for certain relations when lazy loading:
    <ul>
    <li>'group': string, the GROUP BY clause. Defaults to empty. Note, column references need to
      be disambiguated with prefix 'relationName.' (e.g. relationName.age). This option only applies to HAS_MANY and MANY_MANY relations.</li>
    <li>'having': string, the HAVING clause. Defaults to empty. Note, column references need to
      be disambiguated with prefix 'relationName.' (e.g. relationName.age). This option only applies to HAS_MANY and MANY_MANY relations.</li>
    <li>'limit': limit of the rows to be selected. This option does not apply to BELONGS_TO relation.</li>
    <li>'offset': offset of the rows to be selected. This option does not apply to BELONGS_TO relation.</li>
    <li>'through': name of the model's relation that will be used as a bridge when getting related data. Can be set only for HAS_ONE and HAS_MANY. This option has been available since version 1.1.7.</li>
    </ul>

    Below is an example declaring related objects for 'Post' active record class:
    <pre>
    return array(
        'author'=>array(self::BELONGS_TO, 'User', 'author_id'),
        'comments'=>array(self::HAS_MANY, 'Comment', 'post_id', 'with'=>'author', 'order'=>'create_time DESC'),
        'tags'=>array(self::MANY_MANY, 'Tag', 'post_tag(post_id, tag_id)', 'order'=>'name'),
    );
    </pre>

    @return array list of related object declarations. Defaults to empty array.
  • 给magento添加边栏分类目录导航

    magento版本ce version 1.4.1.1

    首先介绍下默认的模板导航调用流程,首先是布局文件catalog.xml中的

    <reference name=”top.menu”>
    <block type=”catalog/navigation” name=”catalog.topnav” template=”catalog/navigation/top.phtml”/>
    </reference>
    (约位于48行)
    当然page.xml中的预定义
    <block type=”core/text_list” name=”top.menu” as=”topMenu”/>   我们就不太关心了

    最后就是位于template/page/html/header.phtml中的
    <?php echo $this->getChildHtml(‘topMenu’) ?> 调用了。

    不过在进行侧边栏调用导航的时候,我们就不需要去getChildHtml了
    主要是参考reference中的内容
    在catalog.xml中 找到 <reference name=”left”> (<reference name=”right”>)然后根据上面的
    <block type=”catalog/navigation” name=”catalog.topnav” template=”catalog/navigation/top.phtml”/>
    修改如下
    <block type=”catalog/navigation” name=”catalog.lrnav” template=”catalog/navigation/lr.phtml”/>
    这边可以看到不光修改了name的属性值,还修改了调用的phtml文件
    因为头部的导航与边栏导航的样式 差别太大,往往还需要同时用到,所以我们不直接对top.phtml进行修改
    lr.phtml的文件内容可以参考top.phtml,修改css样式的类名等即可,当然也可以根据需求编写。

    经过上面一系列的操作,清理缓存之后 应该可以看到边栏的分类导航已经出来了。

    如果你希望调整它的显示位置,对其添加 before 或 after 即可
    比如让它显示在最上面,将上面的块代码修改为:
    <block type=”catalog/navigation” name=”catalog.lrnav” before=”-” template=”catalog/navigation/lr.phtml”/>
    (如果发现位置显示不对,那就需要检查一下其他的block是否也有同样的before=”-”属性)

    ok,就介绍到这边希望对大家有所帮助~!
    如果有问题或建议,欢迎留言交流~!

click
©2010-2024 Jeen All Rights Reserved.Powered by emlog 京ICP备15058100号-1