Importing the Geo-Names database into MsSql Server
|
The article you are reading has moved! It is now available at: http://blog.tinisles.com/2006/08/importing-the-geo-names-database-into-mssql-server/ |
- UNIX2DOS to update the line endings of AllCountries.txt
- Same nant script below to convert AllCountries.txt to UNICODE
- CREATE TABLE now looks like this:
CREATE TABLE [GeoNames] ( [GeonameID] [int] NOT NULL , [Name] [nvarchar] (200) NOT NULL , [AnsiName] [varchar] (200) NULL , [AlternateNames] [nvarchar] (MAX) NULL , [Latitude] [float] NOT NULL , [Longitude] [float] NOT NULL , [FeatureClass] [char] (1) NULL , [FeatureCode] [varchar] (10) NULL , [CountryCode] [char] (2) NULL , [CC2] [varchar] (60) NULL , [Admin1Code] [varchar] (20) NULL , [Admin2Code] [varchar] (80) NULL , [Admin3Code] [varchar] (20) NULL , [Admin4Code] [varchar] (20) NULL , [Population] [bigint] NOT NULL , [Elevation] [int] NULL , [GTopo30] [int] NULL , [Timezone] [varchar] (50) NULL , [ModificationDate] [datetime] NULL )
- EdKatibah's BULK INSERT works nicely:
BULK INSERT GeoNames FROM 'D:\Developer\Geonames\allCountries_unicode.txt' WITH( DATAFILETYPE = 'widechar', FIELDTERMINATOR = '\t', ROWTERMINATOR = '\n' ) GO
Geo-Names "provides free geo-data such as geographical names and postal codes. The database contains over 6 million entries for geographical names whereof 2.2 million cities and villages". The offer their content via a webservice, and as a database download.
I'm using a subset of this data for my 'learn ASP.NET 2.0' project, so I've imported the database export into my SQL Server. Here's the steps I followed:
- Create a table that looks like this (note the nvarchar columns for the extendend characters):
CREATE TABLE [CityNames] ( [GeonameID] [int] NOT NULL , [Name] [nvarchar] (200) NOT NULL , [AnsiName] [varchar] (200) NOT NULL , [Latitude] [decimal](18, 15) NOT NULL , [Longitude] [decimal](18, 15) NOT NULL , [FeatureCode] [varchar] (10) NULL , [CountryCode] [char] (2) NULL , [Admin1Code] [varchar] (20) NULL , [Population] [int] NOT NULL )
- The database export is a UTF8 file - MSSQL DTS import wants a UTF16 file. Easy to convert a file by opening it in Textpad/VS.Net and using 'Save As' to change the encoding. Not so easy with the 500meg 'allCountries' file. The nant copy task works well for this:
<target name="build"> <copy file="allCountries.txt" todir="Converted" inputencoding="UTF-8" outputencoding="UNICODE"/> </target>
- Import the converted file via the DTS Import/Export Wizard. In Enterprise Manager right-click the database > All Tasks > Import Data...
- Follow the prompts to import the converted file. On the file format screen select the following options: File Type = Unicode, Row Delimeter = {LF}, Text Qualifer =
. - If all the names look neatly lined up on the next screen your file has imported okay.
- In the next two steps select the destination database / table created in the first step.
- Just to make sure all the UTF-16 content has come across okay - jump into Query Analyzer and run something like this:
SELECT AlternateNames FROM GeoNames WHERE GeonameID=2147714
Labels: mssql
Thanks, Russ.
I have set a link to this how-to in the geonames faq.
Marc
Posted by Anonymous | 5:13 pm
Can you upload your coverted allCountries.txt file for us to download? I am having problems with the Nant unicode conversion.
Thanks!
Posted by Anonymous | 5:47 pm
I could not get the NAnt file covnversion to work.
I found some simple command line tools that seemed to have done the trick.
http://www.geocities.com/j_ds_au/unicode.htm
I used the Unicode-UTF-8 ~ to ~ Unicode-UTF-16 Converter tool and I am importing the data right now.
-R
Posted by Bobby C | 5:36 am
rob, yes nant is pretty fiddly. i have it on hand for automated builds, so I am pretty familiar with it. at the time any other method I found couldn't handle such a large file.
the utilities you've found defn do the trick - much easier.
Posted by Russ | 11:05 am
can u plz upload converted file somewhere?
Posted by Anonymous | 5:09 am
Bulk load: DataFileType was incorrectly specified as widechar. DataFileType will be assumed to be char because the data file does not have a Unicode signature.
Bulk load: DataFileType was incorrectly specified as widechar. DataFileType will be assumed to be char because the data file does not have a Unicode signature.
(8439966 row(s) affected)
mcupryk@shaw.ca
matt.
Anyone has a file without issues.
Please let me know.
thanks.
Posted by Anonymous | 3:46 am