Groups 139 of 99+ julia-users › load a Julia dataframe from Microsoft SQL Server table 22 posts by 10 authors Charles Brauer 3/17/15 Hi, I'm considering diving into Julia. However, all of my data is in a Microsoft SQL Server database. I would really appreciate a Julia code example on how to load a Julia dataframe from SQL Server table. Thanks Charles Jacob Quinn 3/17/15 Check out the https://github.com/quinnj/ODBC.jl package for connecting to DSN defined in your ODBC manager. - show quoted text - Terry Seaward Feb 3 Other recipients: quinn....@gmail.com Is there a DBI compliant version of this? Also I have endless string conversion issues with this package. - show quoted text - Stefan Karpinski Feb 3 I've been working through encoding issues with this as well. To connect to Microsoft SQL Server, this patch made things work: diff --git a/src/backend.jl b/src/backend.jl index b5f24af..bf4ee11 100644 --- a/src/backend.jl +++ b/src/backend.jl @@ -40,7 +40,7 @@ end # Send query to DMBS function ODBCQueryExecute(stmt::Ptr{Void}, querystring::AbstractString) - if @FAILED SQLExecDirect(stmt, utf16(querystring)) + if @FAILED SQLExecDirect(stmt, utf8(querystring)) ODBCError(SQL_HANDLE_STMT,stmt) error("[ODBC]: SQLExecDirect failed; Return Code: $ret") end The query string gets passed through to SQLExecDirect: #SQLExecDirect #http://msdn.microsoft.com/en-us/library/windows/desktop/ms713611(v=vs.85).aspx #Description: executes a preparable statement #Status: function SQLExecDirect(stmt::Ptr{Void},query::AbstractString) @windows_only ret = ccall( (:SQLExecDirect, odbc_dm), stdcall, Int16, (Ptr{Void},Ptr{UInt8},Int), stmt,query,sizeof(query)) @unix_only ret = ccall( (:SQLExecDirect, odbc_dm), Int16, (Ptr{Void},Ptr{UInt8},Int), stmt,query,sizeof(query)) return ret end This function just convert whatever it's argument is to a pointer. Looking at the docs for this function, the signature is this: SQLRETURN SQLExecDirect( SQLHSTMT StatementHandle, SQLCHAR * StatementText, SQLINTEGER TextLength); and SQLCHAR is defined as unsigned char. So this would seem to be a non-wide character string – i.e. ASCII or UTF-8. And indeed, that's what the Microsoft SQL driver seems to be expecting. The question I have is this: how the heck is this working for other ODBC drivers? How are they getting pointers to UTF-16 data and interpreting it correctly? The correct fix would seem to be to make this always send UTF-8 strings. But when I made a PR that did that, it seemed to break other setups. - show quoted text - Terry Seaward Feb 3 From R, it seems like the encoding is based on the connection (as opposed to being hard coded). See `enc <- attr(channel, "encoding")` below: ``` > RODBC::odbcQuery function (channel, query, rows_at_time = attr(channel, "rows_at_time")) { if (!odbcValidChannel(channel)) stop("first argument is not an open RODBC channel") if (nchar(enc <- attr(channel, "encoding"))) query <- iconv(query, to = enc) .Call(C_RODBCQuery, attr(channel, "handle_ptr"), as.character(query), as.integer(rows_at_time)) } ``` Digging down `odbcConnect` is just a wrapper for `odbcDriverConnect` which has the following parameter `DBMSencoding = ""`. This calls the `C` function `C_RODBCDriverConnect` (available here:RODBC_1.3-12.tar.gz), which has no reference to encodings. So `attr(channel, "encoding")` is simply `DBMSencoding`, i.e. `""`. It seems to come down to `iconv(..., to = "")` which, from the R source code, uses `win_iconv.c` attached. I can't seem to find how `""` is handled, i.e. is there some default value based on the system? ``` > RODBC::odbcDriverConnect function (connection = "", case = "nochange", believeNRows = TRUE, colQuote, tabQuote = colQuote, interpretDot = TRUE, DBMSencoding = "", rows_at_time = 100, readOnlyOptimize = FALSE) { id <- as.integer(1 + runif(1, 0, 1e+05)) stat <- .Call(C_RODBCDriverConnect, as.character(connection), id, as.integer(believeNRows), as.logical(readOnlyOptimize)) if (stat < 0L) { warning("ODBC connection failed") return(stat) } Call <- match.call() res <- .Call(C_RODBCGetInfo, attr(stat, "handle_ptr")) isMySQL <- res[1L] == "MySQL" if (missing(colQuote)) colQuote <- ifelse(isMySQL, "`", "\"") if (missing(case)) case <- switch(res[1L], MySQL = "mysql", PostgreSQL = "postgresql", "nochange") switch(case, toupper = case <- 1L, tolower = case <- 2L, postgresql = case <- 2L, nochange = case <- 0L, msaccess = case <- 0L, mysql = case <- ifelse(.Platform$OS.type == "windows", 2L, 0L), stop("Invalid case parameter: nochange | toupper | tolower | common db names")) case <- switch(case + 1L, "nochange", "toupper", "tolower") rows_at_time <- max(1, min(1024, rows_at_time)) cs <- attr(stat, "connection.string") if (grepl("PWD=", cs)) { attr(stat, "connection.string") <- sub("PWD=[^;]+($|;)", "PWD=******;", cs) Call$connection <- sub("PWD=[^;]+($|;)", "PWD=******;", connection) } structure(stat, class = "RODBC", case = case, id = id, believeNRows = believeNRows, colQuote = colQuote, tabQuote = tabQuote, interpretDot = interpretDot, encoding = DBMSencoding, rows_at_time = rows_at_time, isMySQL = isMySQL, call = Call) } ``` Attachments (2) win_iconv.c 51 KB View Download iconv.h 1 KB View Download Terry Seaward Feb 3 PS, thanks for the patch. Tony Kelman Feb 3 Try to avoid posting snippets of GPL-licensed code if at all possible. That license is viral and it's best not to run any risk of using code with that license in more-permissively-licensed Julia projects. Links are fine, as long as the license status of what you're linking to is clear. On Wednesday, February 3, 2016 at 11:44:13 AM UTC-8, Terry Seaward wrote: From R, it seems like the encoding is based on the connection (as opposed to being hard coded). See `enc <- attr(channel, "encoding")` below: Scott Jones Feb 4 - show quoted text - - show quoted text - - show quoted text - The question I have is this: how the heck is this working for other ODBC drivers? How are they getting pointers to UTF-16 data and interpreting it correctly? The correct fix would seem to be to make this always send UTF-8 strings. But when I made a PR that did that, it seemed to break other setups. SQLCHAR is for encodings with 8-bit code units. It doesn't imply ASCII or UTF-8 (probably one of the more common character sets used with that is actually Microsoft's CP1252, which is often mistakenly described as ANSI Latin-1 - of which it is a superset). Even when something says it is UTF-8, it frequently is not *really* valid UTF-8, for example, there are two common variations of UTF-8, CESU-8, used by MySQL and others, which encodes any non-BMP code point using the two UTF-16 surrogate pairs, i.e. to 6 bytes instead of the correct 4-byte UTF-8 sequence, and Java's Modified UTF-8, which is the same as CESU-8, plus embedded \0s are encoded in a "long" form (0xc0 0x80) so that you can still use old-fashioned C \0 termination for strings while allowing embedded \0s. (note: handling those formats, common in databases, was part of the reason I did the work to rewrite the `convert` functions for strings for Julia). The reason your PR didn't work is because you need to find out from the database which character set it is using, and then do the conversions necessary to go back and forth between that character set and whatever Unicode you are using in Julia (UTF8String, UTF16String, or UTF32String). One thing that could help is the following package that @nalimilan has been working on: https://github.com/nalimilan/StringEncodings.jl) You should also be able to use the SQLWCHAR API, but that can have it's own issues, see the following from Oracle: NOTE: The mapping of SQLWCHAR type is somewhat complicated and it can create hidden pitfalls for programmers porting their code from Windows to Linux. Usually a SQLWCHAR character is a 16-bit unit and we will not consider the exotic cases when SQLWCHAR is different. Windows uses UTF-16 and maps SQLWCHAR to 16-bit wchar_t type. However, many Linux versions such as Ubuntu Linux use UTF-32 as an internal character set and therefore their 32-bit wchar_t is not compatible with SQLWCHAR, which is always 16-bit. Your working theory https://github.com/JuliaDB/ODBC.jl/pull/71#issuecomment-175095266 is incorrect, it is an issue of character encodings used, not platforms. You see platform differences because the databases you are connecting to have different default character encodings on different platforms. Páll Haraldsson Feb 4 On Thursday, February 4, 2016 at 5:33:35 AM UTC, Scott Jones wrote: SQLCHAR is for encodings with 8-bit code units. It doesn't imply ASCII or UTF-8 (probably one of the more common character sets used with that is actually Microsoft's CP1252, which is often mistakenly described as ANSI Latin-1 - of which it is a superset). When I read that, I thought, that must not be true.. You can't have s superset (add letters) without dropping others (implying a superset of a subset), so I looked up: https://en.wikipedia.org/wiki/Windows-1252 "differs from the IANA's ISO-8859-1 by using displayable characters rather than control characters in the 80 to 9F (hex) range. Notable additional characters are curly quotation marks, the Euro sign, and all the printable characters that are in ISO 8859-15. [..] This is now standard behavior in the HTML 5 specification, which requires that documents advertised as ISO-8859-1 actually be parsed with the Windows-1252 encoding.[1] In January 2016 1.0% of all web sites use Windows-1252." Still, despite this 1.0% I think we should support this encoding (in a way, if not its own 8-bit-only type (I'm not sure we need to support any other 8-bit one); it's no longer just some Microsoft thing as I assumed..), as it is ideal for most of Europe (and even the US/world because of "curly quotation"). I've been thinking of doing a sting-type, that does the same as Python, encodes in 8-bit when possible, possibly 7-bit (then it can still say it's UTF-8 and fast indexing is known, note the strings are immutable). It wouldn't surprise me that "UTF-8" would sometimes, incorrectly, include this as the "Latin-1" subset.. I wander if this screws up sorting.. It's not like the exact position of the Euro sign is to important in alphabetical sorting. I could argue it be sorted with E e but I assume just after A-Z a-z if ok for most.. I had never heard of "control characters in the 80 to 9F (hex) range", assuming then it's a very obscure/ancient thing that can be assumed to be never used anymore.. Even when something says it is UTF-8, it frequently is not *really* valid UTF-8, for example, there are two common variations of UTF-8, CESU-8, used by MySQL and others, which encodes any non-BMP code point using the two UTF-16 surrogate pairs, i.e. to 6 bytes instead of the correct 4-byte UTF-8 sequence, and Java's Modified UTF-8, which is the same as CESU-8, plus embedded \0s are encoded in a "long" form (0xc0 0x80) Not only those.. I thought the WTF variant (important for us, because of Windows-filenames?) of UTF-8 was a joke/vandalism at Wikipedia until I read more closely on this I just saw: https://en.wikipedia.org/wiki/UTF-8#WTF-8 -- Palli. Scott Jones Feb 4 I just have have been clearer - it is just a superset of the printable characters, but, as it reuses assigned (even though pretty much never used) control character positions, it is not truly a superset. ASCII is a 7-bit subset of ANSI Latin-1, which is an 8-bit subset of UCS-2, which is a 16-bit subset that can represent only the BMP, which is a subset of the Unicode code points (which need 21 bits). - show quoted text - I ever hadn't run across that in my work, but my work was in databases, usually Unix (AIX, Solaris, etc) or Linux, not so much on Windows any longer, and I added Unicode support before surrogates even existed (they were added in Unicode 2.0, but not actually used until Unicode 3.0). I'm not sure what you'd want to do to convert that for use in Julia? (btw, I think the initials of the "format" says it all!) Milan Bouchet-Valat Feb 4 Le mercredi 03 février 2016 à 11:44 -0800, Terry Seaward a écrit : > From R, it seems like the encoding is based on the connection (as > opposed to being hard coded). See `enc <- attr(channel, "encoding")` > below: > > ``` > [...] > > Digging down `odbcConnect` is just a wrapper for `odbcDriverConnect` > which has the following parameter `DBMSencoding = ""`. This calls the > `C` function `C_RODBCDriverConnect` (available here:RODBC_1.3- > 12.tar.gz), which has no reference to encodings. So `attr(channel, > "encoding")` is simply `DBMSencoding`, i.e. `""`. > > It seems to come down to `iconv(..., to = "")` which, from the R > source code, uses `win_iconv.c` attached. I can't seem to find how > `""` is handled, i.e. is there some default value based on the > system? "" refers to the encoding of the current system locale. This is a reasonable guess, but it will probably be wrong in many cases (else, R wouldn't have provided this option at all). Regards Stefan Karpinski Feb 4 The real issue is this: SQLCHAR is for encodings with 8-bit code units. Condescending lecture on encodings notwithstanding, UTF-16 is not such an encoding, yet UTF-16 is what the ODBC package is currently sending to SQLExecDirect for an argument of type SQLCHAR * – and somehow it seems to be working for many drivers, which still makes no sense to me. I can only conclude that some ODBC drivers are treating this as a void * argument and they expect pointers to data in whatever encoding they prefer, not specifically in encodings with 8-bit code units. Querying the database about what encoding it expects is a good idea, but how does one do that? The SQLGetInfo function seems like a good candidate but this page doesn't include "encoding" or "utf" anywhere. - show quoted text - Stefan Karpinski Feb 4 Not a model of clarity (ANSI and Unicode are not encodings), but this page seems to be the best resource on this: https://msdn.microsoft.com/en-us/library/ms709439(v=vs.85).aspx It seems that there's a parallel "Unicode" API for ODBC drivers that support it. Moreover: Currently, the only Unicode encoding that ODBC supports is UCS-2, which uses a 16-bit integer (fixed length) to represent a character. Unicode allows applications to work in different languages. So using Klingon is off the table. Although the design of UTF-16 is such that sending UTF-16 to an application that expects UCS-2 will probably work reasonably well, as long as it treats it as "just data". This still doesn't explain why some drivers are accepting UCS-2/UTF-16 when called with the non-Unicode API. - show quoted text - Scott Jones Feb 4 My reply wasn't meant to be condescending at all, just trying to explain the issue. UTF-16LE & UTF-16BE *are* encodings of the 16-bit UTF-16 encoding of Unicode onto 8-bit code units. If the server is sending UTF-16 or UTF-32, you should simply use the *W API, period, because in some places the 8-bit API can have problems with the embedded 0x00 bytes. - show quoted text - Scott Jones Feb 4 On Thursday, February 4, 2016 at 1:33:33 PM UTC-5, Stefan Karpinski wrote: Not a model of clarity (ANSI and Unicode are not encodings), but this page seems to be the best resource on this: https://msdn.microsoft.com/en-us/library/ms709439(v=vs.85).aspx It seems that there's a parallel "Unicode" API for ODBC drivers that support it. Moreover: Currently, the only Unicode encoding that ODBC supports is UCS-2, which uses a 16-bit integer (fixed length) to represent a character. Unicode allows applications to work in different languages. So using Klingon is off the table. Although the design of UTF-16 is such that sending UTF-16 to an application that expects UCS-2 will probably work reasonably well, as long as it treats it as "just data". That comment is probably years (decades?) out of date. There is no limit of just UCS-2. This still doesn't explain why some drivers are accepting UCS-2/UTF-16 when called with the non-Unicode API. When you do so, are you actually calling the functions with the A, or just the macro without either A or W? The macro will compile to either the A or the W form, depending on how your application is built. This is a better page in MSDN: https://msdn.microsoft.com/en-us/library/ms712612(v=vs.85).aspx describing what is going on. Stefan Karpinski Feb 4 On Thu, Feb 4, 2016 at 1:50 PM, Scott Jones wrote: This still doesn't explain why some drivers are accepting UCS-2/UTF-16 when called with the non-Unicode API. When you do so, are you actually calling the functions with the A, or just the macro without either A or W? The macro will compile to either the A or the W form, depending on how your application is built. This is a better page in MSDN: https://msdn.microsoft.com/en-us/library/ms712612(v=vs.85).aspx describing what is going on. The ODBC package calls the functions without A or W. What it's calling can't be a macro since macros aren't callable via ccall. But changing ODBC to call the W version of everything may be the fix here. Scott Jones Feb 4 - show quoted text - That very well may be the solution: looking for example at libiodbc on the Mac, it has 3 different versions of all those functions, and I'm not sure just what behavior you get when using the form without the A or W. I've always used ODBC with the C headers, unlike the direct linking that Julia is doing, so that it always gets the W version since I compile as a Unicode build. Jacob Quinn Feb 4 That's a big part of the "remodel" I've been working on to always call the "W" version of the functions and use UTF8 consistently (see here: https://github.com/JuliaDB/ODBC.jl/blob/jq/remodel/src/API.jl). I would certainly welcome those willing to test various configurations/setups. -Jacob - show quoted text - David Anthoff Feb 5 https://msdn.microsoft.com/en-us/library/ms716246%28v=vs.85%29.aspx?f=255&MSPPError=-2147217396 suggests that if you call the version without the A or W suffix you get the ANSI version. From: julia...@googlegroups.com [mailto:julia...@googlegroups.com] On Behalf Of Scott Jones Sent: Thursday, February 4, 2016 1:55 PM To: julia-users Subject: Re: [julia-users] load a Julia dataframe from Microsoft SQL Server table - show quoted text - Stefan Karpinski Feb 5 It does, but that's not what we're seeing – at least with some ODBC drivers. - show quoted text - Jameson Feb 9 Calling the ANSI version doesn't preclude the possibility of getting UTF16 data back. In particular, that would be code page 1200 (utf16le) or 1201 (utf16be) for Microsoft Windows. MSDN is inconsistent in their usage of A and whether it means ANSI, OEM, localized-locale, or application-dependent (aka other) and generally makes no statement about how the bytes may need to be handled or interpreted. - show quoted text - Stefan Karpinski Feb 9 So that's basically a useless API then? I guess we should use the W version everywhere. - show quoted text -