« Home | Better Password Management » | Some Things I Currently Think Are Cool » | VMware update » | Trustix + VMware Server » | IE Javascript problem? » | Scott Guthrie's ASP.NET + Atlas Tutorial » | Google Trends » | Microsoft Virtual Labs » | SyncToy for Windows XP » | Another link drop » 

Sunday, August 20, 2006 

Importing the Geo-Names database into MsSql Server

Update (26-Jul-2009): This article is now a little outdated. The table now has some extra columns. Have a look at Loading GeoNames Data Into SQL Server 2008 by EdKatibah for an up-to-date approach. I've successfully imported the newest allCountries.txt with the following steps:
  1. UNIX2DOS to update the line endings of AllCountries.txt
  2. Same nant script below to convert AllCountries.txt to UNICODE
  3. CREATE TABLE now looks like this:
    CREATE TABLE [GeoNames] (
        [GeonameID] [int] NOT NULL ,
        [Name] [nvarchar] (200) NOT NULL ,
        [AnsiName] [varchar] (200) NULL ,
        [AlternateNames] [nvarchar] (MAX) NULL ,
        [Latitude] [float] NOT NULL ,
        [Longitude] [float] NOT NULL ,
        [FeatureClass] [char] (1) NULL ,
        [FeatureCode] [varchar] (10) NULL ,
        [CountryCode] [char] (2) NULL ,
        [CC2] [varchar] (60) NULL ,
        [Admin1Code] [varchar] (20) NULL ,
        [Admin2Code] [varchar] (80) NULL ,
        [Admin3Code] [varchar] (20) NULL ,
        [Admin4Code] [varchar] (20) NULL ,
        [Population] [bigint] NOT NULL ,
        [Elevation] [int] NULL ,
        [GTopo30] [int] NULL ,
        [Timezone] [varchar] (50) NULL ,
        [ModificationDate] [datetime] NULL 
    ) 
  4. EdKatibah's BULK INSERT works nicely:
    BULK INSERT GeoNames
    FROM 'D:\Developer\Geonames\allCountries_unicode.txt'
    WITH(
       DATAFILETYPE = 'widechar',
       FIELDTERMINATOR = '\t',
       ROWTERMINATOR = '\n'
    )
    GO 
Using BULK INSERT to load the data is MUCH nicer than using BCP. As BULK INSERT will actually tell you where and WHY and import failed. I was originally trying to do this with BCP. I had DECIMAL types for latitude and longitude - which resulted in an infuriating error because a handful of rows have scientific notation representation.

Geo-Names "provides free geo-data such as geographical names and postal codes. The database contains over 6 million entries for geographical names whereof 2.2 million cities and villages". The offer their content via a webservice, and as a database download.

I'm using a subset of this data for my 'learn ASP.NET 2.0' project, so I've imported the database export into my SQL Server. Here's the steps I followed:

  • Create a table that looks like this (note the nvarchar columns for the extendend characters):
    
    CREATE TABLE [CityNames] (
        [GeonameID] [int] NOT NULL ,
        [Name] [nvarchar] (200) NOT NULL ,
        [AnsiName] [varchar] (200) NOT NULL ,
        [Latitude] [decimal](18, 15) NOT NULL ,
        [Longitude] [decimal](18, 15) NOT NULL ,
        [FeatureCode] [varchar] (10) NULL ,
        [CountryCode] [char] (2) NULL ,
        [Admin1Code] [varchar] (20) NULL ,
        [Population] [int] NOT NULL 
    ) 
    
    
  • The database export is a UTF8 file - MSSQL DTS import wants a UTF16 file. Easy to convert a file by opening it in Textpad/VS.Net and using 'Save As' to change the encoding. Not so easy with the 500meg 'allCountries' file. The nant copy task works well for this:
    <target name="build">
        <copy file="allCountries.txt" 
        todir="Converted" inputencoding="UTF-8" 
        outputencoding="UNICODE"/>
    </target>
  • Import the converted file via the DTS Import/Export Wizard. In Enterprise Manager right-click the database > All Tasks > Import Data...
  • Follow the prompts to import the converted file. On the file format screen select the following options: File Type = Unicode, Row Delimeter = {LF}, Text Qualifer = .
  • If all the names look neatly lined up on the next screen your file has imported okay.
  • In the next two steps select the destination database / table created in the first step.
  • Just to make sure all the UTF-16 content has come across okay - jump into Query Analyzer and run something like this:
    SELECT AlternateNames FROM GeoNames WHERE GeonameID=2147714

Labels:

Thanks, Russ.

I have set a link to this how-to in the geonames faq.

Marc

Can you upload your coverted allCountries.txt file for us to download? I am having problems with the Nant unicode conversion.

Thanks!

I could not get the NAnt file covnversion to work.

I found some simple command line tools that seemed to have done the trick.

http://www.geocities.com/j_ds_au/unicode.htm

I used the Unicode-UTF-8 ~ to ~ Unicode-UTF-16 Converter tool and I am importing the data right now.
-R

rob, yes nant is pretty fiddly. i have it on hand for automated builds, so I am pretty familiar with it. at the time any other method I found couldn't handle such a large file.

the utilities you've found defn do the trick - much easier.

can u plz upload converted file somewhere?

Bulk load: DataFileType was incorrectly specified as widechar. DataFileType will be assumed to be char because the data file does not have a Unicode signature.
Bulk load: DataFileType was incorrectly specified as widechar. DataFileType will be assumed to be char because the data file does not have a Unicode signature.

(8439966 row(s) affected)
mcupryk@shaw.ca

matt.

Anyone has a file without issues.
Please let me know.
thanks.

Post a Comment