HiveBrain v1.2.0
Get Started
← Back to all entries
snippetcsharpMinor

Joining datatables to create a new datatable

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
datatablesnewcreatedatatablejoining

Problem

I am joining datatables to create a new datatable:

var row = from r0w1 in dt_vi.AsEnumerable()
              join r0w2 in dt_w.AsEnumerable()
              on r0w1.Field("ID") equals r0w2.Field("iD")
              join r0w3 in dt_re.AsEnumerable()
              on r0w1.Field("ID") equals r0w3.Field("id")
              join r0w4 in dt_def.AsEnumerable()
              on r0w1.Field("ID") equals r0w4.Field("id") into ps
              from r0w4 in ps.DefaultIfEmpty()
              select r0w1.ItemArray.Concat(r0w2.ItemArray.Concat(r0w3.ItemArray.Concat(r0w4 != null ? r0w4.ItemArray : new object[] { }))).ToArray();

foreach (object[] values in row)
    dt.Rows.Add(values);


In the above code,

foreach (object[] values in row)
    dt.Rows.Add(values);


is slow for lakhs of rows. I want to put the data of row into dt datatable. Is there any faster way of doing this?

Solution

I have managed to run some tests (I was very curious about LINQ on objects vs. SQL processing performance):

1) Setup

-- drop table t1
create table t1 (id INT PRIMARY KEY CLUSTERED, name NVARCHAR(1000))
go
-- drop table t2
create table t2 (id INT PRIMARY KEY CLUSTERED, name NVARCHAR(1000))
go
-- drop table t3
create table t3 (id INT PRIMARY KEY CLUSTERED, name NVARCHAR(1000))
go
-- drop table t4
create table t4 (id INT PRIMARY KEY CLUSTERED, name NVARCHAR(1000))
go

insert into t1 select top 200000 ROW_NUMBER() OVER (ORDER BY (SELECT 1)), CAST(text AS NVARCHAR(1000)) from sys.messages
insert into t2 select top 200000 ROW_NUMBER() OVER (ORDER BY (SELECT 1)), CAST(text AS NVARCHAR(1000)) from sys.messages where message_id > 1000
insert into t3 select top 200000 ROW_NUMBER() OVER (ORDER BY (SELECT 1)), CAST(text AS NVARCHAR(1000)) from sys.messages where message_id > 2000
insert into t4 select top 200000 ROW_NUMBER() OVER (ORDER BY (SELECT 1)), CAST(text AS NVARCHAR(1000)) from sys.messages where message_id > 3000


All tables have 2 Lakhs (200K)

Application and SQL run on the same machine (PC).

2) C# code

public static DataTable fillDataTable(SqlConnection sqlConn, string table)
        {
            return fillDataTableByQuery(sqlConn, "SELECT * FROM " + table);
        }

        public static DataTable fillDataTableByQuery(SqlConnection sqlConn, String query)
        {
            DataTable dt = new DataTable();
            using (SqlCommand cmd = new SqlCommand(query, sqlConn))
            {
                SqlDataAdapter da = new SqlDataAdapter(cmd);

                da.Fill(dt);
            }

            return dt;
        }

            var dt = new DataTable();
            for (int i = 0; i ("ID") equals r0w2.Field("iD")
                          join r0w3 in dt3.AsEnumerable().AsParallel()
                            on r0w1.Field("ID") equals r0w3.Field("id")
                          join r0w4 in dt4.AsEnumerable().AsParallel()
                            on r0w1.Field("ID") equals r0w4.Field("id") into ps
                            from r0w4 in ps.DefaultIfEmpty()
                            select r0w1.ItemArray.Concat(r0w2.ItemArray.Concat(r0w3.ItemArray.Concat(r0w4 != null ? r0w4.ItemArray : new object[] { }))).ToArray();

                foreach (object[] values in row)
                    dt.Rows.Add(values);

                Console.WriteLine("Processed in " + sw.ElapsedMilliseconds);
                sw.Reset(); sw.Start();

                var directDt = fillDataTableByQuery(con, @"
select t1.*, t2.*, t3.*, t4.*
from t1
    join t2 on t2.id = t1.id
    join t3 on t3.id = t2.id
    left join t4 on t4.id = t3.id
");

                Console.WriteLine("Sql process + fetch in " + sw.ElapsedMilliseconds);


3) Results

No parallel

Loaded in 4370
Processed in 3249
Sql process + fetch in 3033


With parallel

Loaded in 4240
Processed in 1904
Sql process + fetch in 3160


So, including fetch, SQL performs better in this case (IDs are all clustered indexes). Also, using AsParallel makes a difference, at least on my machine (eight-core).

[later edit] - Dictionary version

Since you have to deal with fwk. 3.5. (VS 2008) and you said that you load data once and do more processing, I think you can try out the following:

class AggrData
    {
        public int Id1 { get; set; }
        public String Name1 { get; set; }
        public int Id2 { get; set; }
        public String Name2 { get; set; }
        public int Id3 { get; set; }
        public String Name3 { get; set; }
        public int Id4 { get; set; }
        public String Name4 { get; set; }
    }

    // value is defined as anonymous type for extensibility
    var dict1 = dt1.AsEnumerable().ToDictionary(r => r.Field("ID"), r => new { Name = r.Field("Name") });
    var dict2 = dt2.AsEnumerable().ToDictionary(r => r.Field("ID"), r => new { Name = r.Field("Name") });
    var dict3 = dt3.AsEnumerable().ToDictionary(r => r.Field("ID"), r => new { Name = r.Field("Name") });
    var dict4 = dt4.AsEnumerable().ToDictionary(r => r.Field("ID"), r => new { Name = r.Field("Name") });

    Console.WriteLine("Dictionary setup took " + sw.ElapsedMilliseconds);
    sw.Reset(); sw.Start();

    var aggrList = new List();
    foreach (var key in dict1.Keys)
    {
        if (!dict2.ContainsKey(key) || !dict3.ContainsKey(key))
            continue;

        var newAggr = new AggrData() 
            {   Id1 = key, Name1 = dict1[key].Name, 
                Id2 = key, Name2 = dict2[key].Name,
                Id3 = key, Name3 = dict3[key].Name
            };
        if (dict4.ContainsKey(key))
        {
            newAggr.Id4 = key;
            newAggr.Name3 = dict4[key].Name;
        }

        aggrList.Add(newAggr);
    }

    Console.WriteLine("Aggregate POCOs fill took " + sw.ElapsedMilliseconds);
    sw.Reset(); sw.Start();


Results:

Dictionary setup took 3654
Aggregate POCOs fill took 202


So, setup is exp

Code Snippets

-- drop table t1
create table t1 (id INT PRIMARY KEY CLUSTERED, name NVARCHAR(1000))
go
-- drop table t2
create table t2 (id INT PRIMARY KEY CLUSTERED, name NVARCHAR(1000))
go
-- drop table t3
create table t3 (id INT PRIMARY KEY CLUSTERED, name NVARCHAR(1000))
go
-- drop table t4
create table t4 (id INT PRIMARY KEY CLUSTERED, name NVARCHAR(1000))
go

insert into t1 select top 200000 ROW_NUMBER() OVER (ORDER BY (SELECT 1)), CAST(text AS NVARCHAR(1000)) from sys.messages
insert into t2 select top 200000 ROW_NUMBER() OVER (ORDER BY (SELECT 1)), CAST(text AS NVARCHAR(1000)) from sys.messages where message_id > 1000
insert into t3 select top 200000 ROW_NUMBER() OVER (ORDER BY (SELECT 1)), CAST(text AS NVARCHAR(1000)) from sys.messages where message_id > 2000
insert into t4 select top 200000 ROW_NUMBER() OVER (ORDER BY (SELECT 1)), CAST(text AS NVARCHAR(1000)) from sys.messages where message_id > 3000
public static DataTable fillDataTable(SqlConnection sqlConn, string table)
        {
            return fillDataTableByQuery(sqlConn, "SELECT * FROM " + table);
        }

        public static DataTable fillDataTableByQuery(SqlConnection sqlConn, String query)
        {
            DataTable dt = new DataTable();
            using (SqlCommand cmd = new SqlCommand(query, sqlConn))
            {
                SqlDataAdapter da = new SqlDataAdapter(cmd);

                da.Fill(dt);
            }

            return dt;
        }

            var dt = new DataTable();
            for (int i = 0; i < 4; i ++ )
            {
                dt.Columns.Add(String.Format("Id{0}", i), typeof(int));
                dt.Columns.Add(String.Format("Name{0}", i), typeof(String));
            }

            using (SqlConnection con = new SqlConnection(connectionString))
            {
                con.Open();

                var sw = new Stopwatch();
                sw.Start();

                var dt1 = fillDataTable(con, "t1");
                var dt2 = fillDataTable(con, "t2");
                var dt3 = fillDataTable(con, "t3");
                var dt4 = fillDataTable(con, "t4");
                Console.WriteLine("Loaded in " + sw.ElapsedMilliseconds);
                sw.Reset(); sw.Start();

                var row = from r0w1 in dt1.AsEnumerable().AsParallel()
                          join r0w2 in dt2.AsEnumerable().AsParallel()
                            on r0w1.Field<int>("ID") equals r0w2.Field<int>("iD")
                          join r0w3 in dt3.AsEnumerable().AsParallel()
                            on r0w1.Field<int?>("ID") equals r0w3.Field<int?>("id")
                          join r0w4 in dt4.AsEnumerable().AsParallel()
                            on r0w1.Field<int?>("ID") equals r0w4.Field<int?>("id") into ps
                            from r0w4 in ps.DefaultIfEmpty()
                            select r0w1.ItemArray.Concat(r0w2.ItemArray.Concat(r0w3.ItemArray.Concat(r0w4 != null ? r0w4.ItemArray : new object[] { }))).ToArray();

                foreach (object[] values in row)
                    dt.Rows.Add(values);

                Console.WriteLine("Processed in " + sw.ElapsedMilliseconds);
                sw.Reset(); sw.Start();

                var directDt = fillDataTableByQuery(con, @"
select t1.*, t2.*, t3.*, t4.*
from t1
    join t2 on t2.id = t1.id
    join t3 on t3.id = t2.id
    left join t4 on t4.id = t3.id
");

                Console.WriteLine("Sql process + fetch in " + sw.ElapsedMilliseconds);
Loaded in 4370
Processed in 3249
Sql process + fetch in 3033
Loaded in 4240
Processed in 1904
Sql process + fetch in 3160
class AggrData
    {
        public int Id1 { get; set; }
        public String Name1 { get; set; }
        public int Id2 { get; set; }
        public String Name2 { get; set; }
        public int Id3 { get; set; }
        public String Name3 { get; set; }
        public int Id4 { get; set; }
        public String Name4 { get; set; }
    }

    // value is defined as anonymous type for extensibility
    var dict1 = dt1.AsEnumerable().ToDictionary(r => r.Field<int>("ID"), r => new { Name = r.Field<String>("Name") });
    var dict2 = dt2.AsEnumerable().ToDictionary(r => r.Field<int>("ID"), r => new { Name = r.Field<String>("Name") });
    var dict3 = dt3.AsEnumerable().ToDictionary(r => r.Field<int>("ID"), r => new { Name = r.Field<String>("Name") });
    var dict4 = dt4.AsEnumerable().ToDictionary(r => r.Field<int>("ID"), r => new { Name = r.Field<String>("Name") });

    Console.WriteLine("Dictionary setup took " + sw.ElapsedMilliseconds);
    sw.Reset(); sw.Start();

    var aggrList = new List<AggrData>();
    foreach (var key in dict1.Keys)
    {
        if (!dict2.ContainsKey(key) || !dict3.ContainsKey(key))
            continue;

        var newAggr = new AggrData() 
            {   Id1 = key, Name1 = dict1[key].Name, 
                Id2 = key, Name2 = dict2[key].Name,
                Id3 = key, Name3 = dict3[key].Name
            };
        if (dict4.ContainsKey(key))
        {
            newAggr.Id4 = key;
            newAggr.Name3 = dict4[key].Name;
        }

        aggrList.Add(newAggr);
    }

    Console.WriteLine("Aggregate POCOs fill took " + sw.ElapsedMilliseconds);
    sw.Reset(); sw.Start();

Context

StackExchange Code Review Q#114368, answer score: 3

Revisions (0)

No revisions yet.