SlideShare a Scribd company logo
1 of 5
Download to read offline
Building	
  a	
  Data	
  Warehouse	
  of	
  Call	
  Data	
  Records	
  (CDR):	
  Solution	
  Considerations	
  
	
  
With	
  many	
  organisations	
  having	
  established	
  data	
  warehouse	
  solutions	
  there	
  
comes	
  a	
  point	
  where	
  the	
  solution	
  needs	
  to	
  be	
  extended	
  in	
  an	
  unusual	
  direction.	
  
This	
  challenge	
  was	
  given	
  to	
  us	
  by	
  one	
  of	
  our	
  clients	
  a	
  European	
  Mobile	
  Virtual	
  
Network	
  Operator	
  (MVNO)	
  who	
  wanted	
  to	
  add	
  both	
  compliance	
  and	
  archiving	
  
functionality	
  to	
  their	
  existing	
  data	
  warehouse.	
  	
  
	
  
If	
  you	
  don’t	
  know,	
  a	
  MVNO	
  is	
  a	
  mobile	
  phone	
  operator	
  that	
  directly	
  services	
  its	
  
own	
  customers	
  but	
  does	
  not	
  own	
  the	
  network	
  assets,	
  instead	
  leasing	
  these	
  from	
  
a	
  mobile	
  (non-­‐virtual)	
  network	
  operator.	
  The	
  company’s	
  profits	
  are	
  derived	
  
from	
  the	
  margin	
  between	
  the	
  price	
  a	
  subscriber	
  pays	
  to	
  the	
  MVNO	
  and	
  the	
  
wholesale	
  cost	
  of	
  purchasing	
  capacity	
  from	
  the	
  underlying	
  host	
  operator.	
  
	
  
MVNOs	
  differentiate	
  themselves	
  by	
  very	
  competitive	
  pricing	
  on	
  certain	
  parts	
  of	
  
the	
  call	
  package,	
  for	
  example	
  ethnic	
  MVNOs	
  target	
  ethnic	
  communities	
  by	
  
providing	
  inexpensive	
  calls	
  to	
  their	
  home	
  country.	
  To	
  enable	
  such	
  offerings	
  and	
  
to	
  track	
  costs,	
  especially	
  where	
  subscribers	
  start	
  using	
  the	
  service	
  in	
  unexpected	
  
and	
  possibly	
  costly	
  ways,	
  is	
  a	
  key	
  business	
  function	
  and	
  this	
  is	
  where	
  the	
  data	
  
warehouse	
  is	
  used.	
  
	
  
The	
  primary	
  role	
  of	
  the	
  data	
  warehouse	
  is	
  to	
  profile	
  subscribers	
  with	
  their	
  
associated	
  attributes	
  and	
  their	
  call	
  usage	
  pattern.	
  Our	
  client	
  has	
  this	
  built	
  in	
  a	
  
database	
  that	
  provided	
  them	
  with	
  all	
  of	
  the	
  subscriber	
  information	
  and	
  three	
  
months	
  of	
  call	
  detail	
  records	
  (CDRs)	
  and	
  data	
  detail	
  records	
  (IPDRs).	
  Beyond	
  the	
  
three-­‐month	
  limit	
  there	
  was	
  aggregate	
  data	
  about	
  calls	
  but	
  no	
  direct	
  access	
  to	
  
the	
  original	
  call	
  records.	
  
	
  
This	
  caused	
  the	
  business	
  a	
  number	
  of	
  challenges:	
  
	
  
       1. To	
  provide	
  a	
  cost	
  effective	
  query-­‐able	
  archive	
  of	
  CDR/IPs.	
  
              	
  
              The	
  original	
  solution	
  archived	
  the	
  incoming	
  CDR	
  files	
  to	
  disk	
  and	
  
              subsequently	
  to	
  CD.	
  This	
  meant	
  that	
  if	
  specific	
  set	
  of	
  records	
  were	
  
              required	
  they	
  had	
  to	
  be	
  recovered,	
  re-­‐loaded	
  and	
  then	
  queried.	
  
              	
  
       2. To	
  enable	
  data	
  retention	
  law	
  compliance	
  
              	
  
              EU	
  law	
  (amongst	
  others)	
  requires	
  Telcos	
  to	
  retain	
  call	
  and	
  data	
  detail	
  
              records	
  for	
  a	
  period	
  between	
  6	
  and	
  24	
  months.	
  Whilst	
  the	
  MVNO	
  was	
  in	
  
              compliance	
  with	
  the	
  regulations,	
  much	
  of	
  the	
  data	
  was	
  stored	
  on	
  CD	
  and	
  
              so	
  suffered	
  from	
  the	
  same	
  problem	
  as	
  archive	
  data.	
  
              	
  
       3. Enhanced	
  CDR	
  analysis	
  in	
  the	
  existing	
  data	
  warehouse	
  
              	
  
              The	
  business	
  was	
  looking	
  to	
  be	
  able	
  to	
  analyse	
  more	
  of	
  the	
  CDR	
  data	
  on	
  
              an	
  ad	
  hoc	
  basis	
  without	
  the	
  need	
  to	
  store	
  more	
  data	
  in	
  the	
  main	
  data	
  
              warehouse.	
  They	
  also	
  wanted	
  to	
  be	
  able	
  to	
  change	
  the	
  stored	
  aggregates	
  
quickly	
  and	
  easily	
  when	
  new	
  data	
  patterns	
  were	
  found.	
  Previously	
  this	
  
           had	
  again	
  required	
  the	
  recovery	
  and	
  re-­‐loading	
  of	
  archived	
  data.	
  
   	
  
   The	
  objective	
  of	
  the	
  project	
  was	
  to	
  design	
  a	
  solution	
  that	
  allowed	
  CDRs	
  to	
  be	
  
   stored	
  for	
  immediate	
  access	
  for	
  at	
  least	
  two	
  years	
  and	
  that	
  could	
  easily	
  be	
  
   extended	
  to	
  support	
  up	
  to	
  seven	
  years	
  of	
  data.	
  As	
  a	
  consequence	
  it	
  is	
  also	
  
   envisioned	
  that	
  the	
  solution	
  would	
  improve	
  the	
  performance	
  of	
  aggregate	
  
   load	
  and	
  rebuild	
  when	
  required	
  because	
  the	
  CDRs	
  are	
  available	
  online.	
  	
  
   	
  
   Solution	
  Considerations	
  
   	
  
   Given	
  the	
  business	
  requirements	
  a	
  number	
  of	
  technical	
  aspects	
  were	
  then	
  
   considered.	
  These	
  can	
  be	
  outlined	
  as	
  follows:	
  
   	
  
1) Should	
  Hadoop	
  or	
  NoSQL	
  solutions	
  be	
  used?	
  
   	
  
   Hadoop	
  and	
  other	
  NoSQL	
  solutions	
  have	
  grabbed	
  a	
  significant	
  number	
  of	
  
   column	
  inches	
  over	
  the	
  past	
  couple	
  of	
  years	
  and	
  in	
  the	
  right	
  circumstances	
  
   they	
  offer	
  a	
  very	
  cost	
  effective	
  and	
  high	
  performance	
  solution.	
  The	
  discussion	
  
   about	
  the	
  use	
  of	
  a	
  Hadoop	
  or	
  NoSQL	
  solution	
  centred	
  on	
  the	
  following	
  
   aspects:	
  
   	
  
   a) A	
  strength	
  of	
  Hadoop	
  and	
  NoSQL	
  solutions	
  is	
  in	
  allowing	
  easy	
  mapping	
  of	
  
            unstructured	
  data.	
  The	
  data	
  files	
  in	
  use	
  had	
  a	
  strong	
  pre-­‐defined	
  structure	
  
            –	
  and	
  importantly	
  any	
  change	
  to	
  this	
  structure	
  would	
  signify	
  a	
  material	
  
            change	
  across	
  a	
  number	
  of	
  business	
  systems.	
  There	
  is	
  no	
  unstructured	
  
            data	
  that	
  we	
  wish	
  to	
  analyse.	
  
            	
  
   b) There	
  are	
  no	
  existing	
  skills	
  within	
  the	
  IT	
  organisation.	
  This	
  in	
  itself	
  should	
  
            not	
  be	
  a	
  blocker	
  to	
  technological	
  innovation	
  within	
  a	
  company.	
  However,	
  
            this	
  particular	
  project	
  is	
  about	
  maintaining	
  business	
  as	
  usual	
  and	
  can’t	
  be	
  
            used	
  as	
  a	
  vehicle	
  for	
  internal	
  training.	
  
            	
  
   c) NoSQL	
  –	
  well	
  exactly	
  that	
  –	
  not	
  only	
  does	
  the	
  IT	
  department	
  have	
  no	
  skills	
  
            but	
  our	
  user	
  community	
  has	
  a	
  range	
  of	
  SQL	
  tools	
  –	
  if	
  the	
  NoSQL	
  route	
  was	
  
            selected	
  then	
  any	
  data	
  extraction	
  would	
  fall	
  back	
  onto	
  the	
  IT	
  department	
  
            rather	
  than	
  self	
  service	
  by	
  business	
  users.	
  
            	
  
   d) The	
  business	
  has	
  a	
  range	
  of	
  Business	
  Intelligence	
  tools	
  such	
  as	
  Business	
  
            Objects	
  and	
  these	
  require	
  an	
  SQL	
  database.	
  It	
  was	
  noted	
  that	
  at	
  this	
  point	
  
            in	
  time	
  there	
  are	
  no	
  significant	
  BI	
  tools	
  that	
  work	
  with	
  NoSQL	
  solutions	
  
            	
  
   e) Internal	
  perception.	
  Some	
  companies	
  choose	
  to	
  be	
  at	
  the	
  vanguard	
  or	
  
            bleeding	
  edge	
  of	
  technology,	
  others	
  are	
  the	
  last	
  to	
  change	
  or	
  upgrade.	
  This	
  
            company	
  favours	
  the	
  middle	
  ground	
  of	
  choosing	
  established	
  technologies	
  
            that	
  offer	
  a	
  route	
  to	
  the	
  future.	
  
            	
  
2) If	
  it	
  is	
  a	
  database	
  then	
  what	
  type	
  of	
  database?	
  
   	
  
Once	
  the	
  Hadoop	
  debate	
  had	
  been	
  put	
  to	
  rest	
  the	
  discussion	
  turned	
  to	
  what	
  
type	
  of	
  database	
  to	
  use.	
  There	
  were	
  three	
  options	
  that	
  were	
  considered.	
  	
  
	
  
a) Row	
  Store	
  Databases	
  
     	
  
     A	
  row	
  store	
  database	
  is	
  the	
  type	
  of	
  database	
  that	
  most	
  people	
  will	
  be	
  
     familiar	
  with.	
  Each	
  record	
  is	
  stored	
  as	
  a	
  row	
  in	
  a	
  table	
  and	
  queried	
  using	
  
     SQL.	
  The	
  organisation	
  has	
  a	
  number	
  of	
  different	
  row	
  store	
  databases	
  
     platforms	
  for	
  operational	
  systems	
  however	
  the	
  existing	
  Data	
  Warehouse	
  
     is	
  based	
  on	
  Sybase	
  ASE	
  so	
  this	
  would	
  be	
  the	
  natural	
  choice.	
  The	
  primary	
  
     benefit	
  of	
  this	
  would	
  be	
  to	
  continue	
  to	
  use	
  existing	
  technology,	
  however	
  
     as	
  a	
  large	
  database	
  it	
  would	
  not	
  necessarily	
  be	
  performant	
  and	
  would	
  
     require	
  significantly	
  more	
  disk	
  space	
  than	
  the	
  current	
  raw	
  files	
  as	
  both	
  
     data	
  and	
  index	
  space	
  would	
  be	
  required.	
  
     	
  
     	
  
b) MPP	
  based	
  data	
  warehouse	
  appliances	
  
     	
  
     There	
  are	
  a	
  number	
  of	
  massively	
  parallel	
  processing	
  (MPP)	
  architectures.	
  
     These	
  also	
  store	
  the	
  data	
  in	
  rows	
  however	
  they	
  apply	
  many	
  CPUs	
  with	
  
     data	
  on	
  disks	
  close	
  to	
  the	
  CPU	
  to	
  speed	
  up	
  the	
  problem.	
  Whilst	
  these	
  can	
  
     be	
  very	
  attractive	
  solutions	
  they	
  posed	
  two	
  specific	
  problems	
  to	
  the	
  
     project.	
  	
  
     	
  
     Firstly	
  it	
  would	
  not	
  be	
  cost	
  effective	
  to	
  retain	
  the	
  existing	
  data	
  warehouse	
  
     on	
  the	
  current	
  platform	
  and	
  create	
  an	
  archive	
  store	
  on	
  an	
  MPP	
  solution	
  
     because	
  MPP	
  solutions	
  generally	
  match	
  CPU	
  and	
  disk	
  together	
  for	
  
     performance.	
  Therefore	
  to	
  get	
  high	
  performance	
  there	
  would	
  be	
  a	
  
     significant	
  under-­‐utilisation	
  of	
  the	
  available	
  disk	
  resource.	
  
     	
  
     Secondly	
  (and	
  as	
  a	
  consequence	
  of	
  the	
  first	
  issue)	
  if	
  the	
  project	
  were	
  to	
  
     move	
  the	
  existing	
  data	
  warehouse	
  to	
  an	
  MPP	
  platform	
  it	
  would	
  require	
  
     porting/rewriting	
  of	
  the	
  load	
  processes	
  (also	
  known	
  as	
  ETL)	
  that	
  would	
  
     add	
  cost	
  and	
  time.	
  This	
  was	
  exacerbated	
  by	
  the	
  use	
  of	
  platform	
  specific	
  
     user	
  defined	
  functions	
  (UDF)	
  in	
  the	
  current	
  data	
  warehouse	
  that	
  are	
  not	
  
     available	
  on	
  the	
  MPP	
  platforms.	
  
     	
  
c) Column-­‐Store	
  Databases	
  
     	
  
     The	
  third	
  option	
  was	
  the	
  choice	
  of	
  a	
  column	
  store	
  database.	
  To	
  the	
  user	
  
     these	
  look	
  identical	
  to	
  a	
  row	
  based	
  database	
  but	
  behind	
  the	
  scenes	
  the	
  
     data	
  is	
  actually	
  stored	
  in	
  its	
  columns	
  with	
  a	
  record	
  of	
  data	
  being	
  made	
  up	
  
     of	
  pointers	
  to	
  the	
  data	
  held	
  in	
  the	
  columns.	
  
     	
  
     Column	
  store	
  databases	
  consequently	
  store	
  only	
  one	
  copy	
  of	
  each	
  value	
  in	
  
     a	
  column.	
  If	
  you	
  have	
  a	
  column	
  in	
  a	
  table	
  that	
  contains	
  the	
  calling	
  
     telephone	
  number	
  then	
  each	
  telephone	
  number	
  will	
  be	
  repeated	
  many	
  
     times.	
  Storing	
  it	
  only	
  once	
  and	
  using	
  a	
  pointer	
  to	
  the	
  string	
  considerably	
  
     reduces	
  space	
  and	
  because	
  there	
  is	
  less	
  space	
  used	
  there	
  is	
  less	
  I/O	
  
required	
  to	
  retrieve	
  the	
  data	
  from	
  the	
  disk	
  that	
  also	
  speeds	
  up	
  the	
  query	
  
           time.	
  Both	
  a	
  smaller	
  disk	
  foot	
  print	
  and	
  faster	
  query	
  time	
  make	
  column	
  
           store	
  databases	
  very	
  attractive.	
  
           	
  
           Column	
  store	
  databases	
  are	
  considered	
  more	
  difficult	
  than	
  traditional	
  
           row	
  based	
  databases	
  to	
  update	
  because	
  the	
  update	
  has	
  to	
  check	
  the	
  data	
  
           in	
  the	
  record	
  map	
  and	
  in	
  the	
  column	
  store	
  itself.	
  This	
  is	
  not	
  an	
  issue	
  for	
  
           data	
  warehousing	
  and	
  archiving	
  solutions	
  where	
  the	
  data	
  is	
  usually	
  
           loaded	
  once	
  and	
  not	
  updated.	
  
           	
  
           Despite	
  a	
  lack	
  of	
  general	
  awareness	
  of	
  column	
  store	
  databases	
  they	
  have	
  
           been	
  around	
  since	
  the	
  mid	
  1970s	
  and	
  Wikipedia	
  currently	
  lists	
  21	
  
           commercial	
  options	
  and	
  8	
  open	
  source	
  solutions.	
  The	
  most	
  established	
  
           column	
  store	
  database	
  in	
  the	
  market	
  is	
  Sybase	
  IQ,	
  which	
  fitted	
  nicely	
  with	
  
           the	
  existing	
  data	
  warehouse	
  on	
  Sybase	
  ASE.	
  
	
  
Based	
  on	
  the	
  choices	
  the	
  client	
  chose	
  to	
  go	
  with	
  a	
  column-­‐based	
  database.	
  The	
  
next	
  stage	
  of	
  the	
  project	
  was	
  to	
  build	
  a	
  prototype	
  Sybase	
  IQ	
  store	
  to	
  understand	
  
what	
  sort	
  of	
  performance	
  and	
  compression	
  we	
  obtained.	
  	
  
	
  
Solution	
  Design	
  
	
  
The	
  solution	
  design	
  was	
  to	
  create	
  a	
  simple	
  schema	
  with	
  two	
  types	
  of	
  tables,	
  one	
  
for	
  the	
  CDR	
  data	
  and	
  one	
  for	
  the	
  IPDR	
  data.	
  It	
  was	
  also	
  decided	
  that	
  the	
  tables	
  
would	
  be	
  split	
  into	
  multiple	
  datasets	
  based	
  on	
  the	
  start	
  date	
  of	
  the	
  call	
  or	
  data	
  
session.	
  As	
  a	
  result	
  we	
  ended	
  with	
  a	
  number	
  of	
  tables	
  of	
  the	
  form	
  CDR_2012,	
  
CDR_2011,	
  etc.	
  and	
  IPDR_2012,	
  IPDR_2011.	
  	
  
	
  
Views	
  were	
  created	
  over	
  these	
  tables,	
  called	
  CDR	
  over	
  the	
  CDR_XXXX	
  tables	
  and	
  
IPDR	
  for	
  the	
  IPDR_XXXX	
  tables.	
  A	
  further	
  view	
  was	
  created	
  called	
  
DATA_RECORDS	
  across	
  all	
  CDR_XXXX	
  and	
  IPDR_XXXX	
  tables	
  but	
  only	
  selecting	
  
the	
  fields	
  that	
  were	
  common	
  to	
  both	
  (start	
  time,	
  cost,	
  record	
  type,	
  etc.).	
  This	
  was	
  
done	
  for	
  future	
  ease	
  of	
  management,	
  as	
  it	
  would	
  be	
  possible	
  to	
  remove	
  an	
  entire	
  
year	
  of	
  data	
  simply	
  by	
  rebuilding	
  the	
  view	
  to	
  exclude	
  the	
  year	
  and	
  then	
  dropping	
  
the	
  table.	
  
	
  
One	
  year	
  of	
  call	
  records	
  consisted	
  of	
  around	
  500GB	
  of	
  raw	
  data	
  and	
  1	
  Billion	
  
rows	
  of	
  CDRs.	
  When	
  loaded	
  into	
  the	
  database	
  and	
  it	
  took	
  up	
  less	
  than	
  100GB	
  
(achieving	
  a	
  compression	
  ratio	
  of	
  5.7:1).	
  Similar	
  results	
  were	
  obtained	
  for	
  the	
  
IPDRs.	
  	
  Therefore	
  a	
  2TB	
  system	
  (4TB	
  when	
  mirrored)	
  could	
  effectively	
  hold	
  7	
  
years	
  of	
  history	
  even	
  allowing	
  for	
  predicted	
  growth.	
  
	
  
The	
  testing	
  split	
  the	
  performance	
  into	
  two	
  aspects	
  –	
  the	
  load	
  time	
  and	
  query	
  
performance.	
  	
  
	
  
The	
  CDRs	
  are	
  delivered	
  from	
  the	
  host	
  operator	
  on	
  an	
  hourly	
  schedule	
  as	
  fixed	
  
format	
  flat	
  ASCII	
  file	
  that	
  typically	
  contained	
  around	
  35,000	
  records,	
  where	
  each	
  
record	
  was	
  583	
  characters	
  long	
  containing	
  80	
  fields.	
  Loading	
  an	
  average	
  file	
  
(typically	
  35K	
  records	
  in	
  a	
  2MB	
  file)	
  took	
  around	
  1	
  second	
  and	
  since	
  there	
  is	
  
around	
  200	
  files	
  per	
  day	
  this	
  was	
  not	
  a	
  burden	
  on	
  the	
  system,	
  even	
  allowing	
  data	
  
to	
  be	
  loaded	
  throughout	
  the	
  day	
  rather	
  than	
  a	
  single	
  nightly	
  batch	
  to	
  allow	
  more	
  
immediate	
  access	
  to	
  the	
  CDRs.	
  
	
  
Initially	
  two	
  years	
  of	
  data	
  was	
  loaded	
  although	
  this	
  will	
  be	
  allowed	
  to	
  grow	
  over	
  
the	
  coming	
  years.	
  User	
  queries	
  directly	
  on	
  the	
  Sybase	
  IQ	
  database	
  ranged	
  
dramatically	
  depending	
  on	
  the	
  type	
  and	
  complexity	
  but	
  were	
  typically	
  10	
  to	
  15	
  
times	
  faster	
  than	
  the	
  equivalent	
  Sybase	
  ASE	
  queries.	
  In	
  addition	
  the	
  queries	
  were	
  
working	
  on	
  a	
  24-­‐month	
  data	
  set	
  rather	
  than	
  the	
  original	
  3-­‐month	
  data	
  set.	
  
	
  
The	
  existing	
  data	
  warehouse	
  was	
  left	
  in	
  situ	
  and	
  using	
  Sybase	
  ASE	
  with	
  Sybase	
  
IQ	
  permitted	
  the	
  users	
  to	
  access	
  both	
  data	
  sets	
  from	
  a	
  single	
  unified	
  interface.	
  	
  
The	
  aggregate	
  load,	
  held	
  in	
  the	
  existing	
  Sybase	
  ASE	
  Data	
  Warehouse,	
  was	
  also	
  
modified	
  to	
  build	
  from	
  the	
  Sybase	
  IQ	
  store	
  rather	
  than	
  the	
  raw	
  files	
  and	
  this	
  
reduced	
  load	
  complexity	
  and	
  allowed	
  the	
  IT	
  team	
  to	
  be	
  more	
  responsive	
  in	
  
delivering	
  new	
  aggregates.	
  
	
  
In	
  summary	
  a	
  CDR/IPDR	
  archive	
  was	
  created	
  on	
  a	
  second	
  server	
  with	
  directly	
  
attached	
  storage,	
  roughly	
  equivalent	
  in	
  size	
  and	
  performance	
  to	
  the	
  original	
  data	
  
warehouse	
  platform,	
  that	
  was	
  accessible	
  via	
  SQL,	
  allowed	
  users	
  access	
  from	
  
within	
  the	
  current	
  environment,	
  compressed	
  the	
  data	
  by	
  more	
  the	
  five	
  times	
  and	
  
was	
  typically	
  ten	
  times	
  faster	
  to	
  query.	
  	
  
	
  
	
  
	
  
About	
  the	
  author	
  
	
  
David	
   Walker	
   has	
   been	
   involved	
   with	
   Business	
   Intelligence	
   and	
   Data	
  
Warehousing	
   for	
   over	
   20	
   years,	
   first	
   as	
   a	
   user,	
   then	
   with	
   a	
   software	
  house	
   and	
  
finally	
  with	
  a	
  hardware	
  vendor	
  before	
  setting	
  up	
  his	
  own	
  consultancy	
  firm	
  Data	
  
Management	
  &	
  Warehousing	
  (http://www.datamgmt.com)	
  in	
  1995.	
  
	
  
David	
   and	
   his	
   team	
   have	
   worked	
   around	
   the	
   world	
   on	
   projects	
   designed	
   to	
  
deliver	
  the	
  maximum	
  benefit	
  to	
  the	
  business	
  by	
  converting	
  data	
  into	
  information	
  
and	
  finding	
  innovative	
  ways	
  in	
  which	
  to	
  help	
  businesses	
  exploit	
  that	
  information.	
  
	
  
David’s	
  project	
  work	
  has	
  given	
  him	
  wide	
  industry	
  experiences	
  including	
  Telcos,	
  
Finance,	
   Retail,	
   Manufacturing,	
   Transport	
   and	
   Public	
   Sector	
   as	
   well	
   as	
   a	
   broad	
  
and	
   deep	
   knowledge	
   of	
   Business	
   Intelligence	
   and	
   Data	
   Warehousing	
  
technologies.	
  
	
  
David	
   has	
   written	
   extensively	
   on	
   all	
   aspects	
   of	
   data	
   warehousing	
   from	
   the	
  
technical	
   architecture	
   and	
   data	
   model	
   design	
   through	
   ETL	
   and	
   data	
   quality	
   to	
  
project	
  management	
  and	
  governance.	
  
	
  

More Related Content

What's hot

SQL Server Managing Test Data & Stress Testing January 2011
SQL Server Managing Test Data & Stress Testing January 2011SQL Server Managing Test Data & Stress Testing January 2011
SQL Server Managing Test Data & Stress Testing January 2011Mark Ginnebaugh
 
Enabling Cloud Data Integration (EMEA)
Enabling Cloud Data Integration (EMEA)Enabling Cloud Data Integration (EMEA)
Enabling Cloud Data Integration (EMEA)Denodo
 
Infrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsInfrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsCognizant
 
CYBER INFRASTRUCTURE AS A SERVICE TO EMPOWER MULTIDISCIPLINARY, DATA-DRIVEN S...
CYBER INFRASTRUCTURE AS A SERVICE TO EMPOWER MULTIDISCIPLINARY, DATA-DRIVEN S...CYBER INFRASTRUCTURE AS A SERVICE TO EMPOWER MULTIDISCIPLINARY, DATA-DRIVEN S...
CYBER INFRASTRUCTURE AS A SERVICE TO EMPOWER MULTIDISCIPLINARY, DATA-DRIVEN S...ijcsit
 
hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training reportSarvesh Meena
 
The technology of the business data lake
The technology of the business data lakeThe technology of the business data lake
The technology of the business data lakeCapgemini
 
Data Ware House System in Cloud Environment
Data Ware House System in Cloud EnvironmentData Ware House System in Cloud Environment
Data Ware House System in Cloud EnvironmentIJERA Editor
 
REAL-TIME CHANGE DATA CAPTURE USING STAGING TABLES AND DELTA VIEW GENERATION...
 REAL-TIME CHANGE DATA CAPTURE USING STAGING TABLES AND DELTA VIEW GENERATION... REAL-TIME CHANGE DATA CAPTURE USING STAGING TABLES AND DELTA VIEW GENERATION...
REAL-TIME CHANGE DATA CAPTURE USING STAGING TABLES AND DELTA VIEW GENERATION...ijiert bestjournal
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap IT Strategy Group
 
Queues, Pools and Caches - Paper
Queues, Pools and Caches - PaperQueues, Pools and Caches - Paper
Queues, Pools and Caches - PaperGwen (Chen) Shapira
 
Use Big Data Technologies to Modernize Your Enterprise Data Warehouse
Use Big Data Technologies to Modernize Your Enterprise Data Warehouse Use Big Data Technologies to Modernize Your Enterprise Data Warehouse
Use Big Data Technologies to Modernize Your Enterprise Data Warehouse EMC
 
VTU 7TH SEM CSE DATA WAREHOUSING AND DATA MINING SOLVED PAPERS OF DEC2013 JUN...
VTU 7TH SEM CSE DATA WAREHOUSING AND DATA MINING SOLVED PAPERS OF DEC2013 JUN...VTU 7TH SEM CSE DATA WAREHOUSING AND DATA MINING SOLVED PAPERS OF DEC2013 JUN...
VTU 7TH SEM CSE DATA WAREHOUSING AND DATA MINING SOLVED PAPERS OF DEC2013 JUN...vtunotesbysree
 
Data warehouse design
Data warehouse designData warehouse design
Data warehouse designines beltaief
 

What's hot (20)

SQL Server Managing Test Data & Stress Testing January 2011
SQL Server Managing Test Data & Stress Testing January 2011SQL Server Managing Test Data & Stress Testing January 2011
SQL Server Managing Test Data & Stress Testing January 2011
 
Enabling Cloud Data Integration (EMEA)
Enabling Cloud Data Integration (EMEA)Enabling Cloud Data Integration (EMEA)
Enabling Cloud Data Integration (EMEA)
 
Infrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsInfrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical Workloads
 
CYBER INFRASTRUCTURE AS A SERVICE TO EMPOWER MULTIDISCIPLINARY, DATA-DRIVEN S...
CYBER INFRASTRUCTURE AS A SERVICE TO EMPOWER MULTIDISCIPLINARY, DATA-DRIVEN S...CYBER INFRASTRUCTURE AS A SERVICE TO EMPOWER MULTIDISCIPLINARY, DATA-DRIVEN S...
CYBER INFRASTRUCTURE AS A SERVICE TO EMPOWER MULTIDISCIPLINARY, DATA-DRIVEN S...
 
hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training report
 
The technology of the business data lake
The technology of the business data lakeThe technology of the business data lake
The technology of the business data lake
 
Data Ware House System in Cloud Environment
Data Ware House System in Cloud EnvironmentData Ware House System in Cloud Environment
Data Ware House System in Cloud Environment
 
IJARCCE_49
IJARCCE_49IJARCCE_49
IJARCCE_49
 
REAL-TIME CHANGE DATA CAPTURE USING STAGING TABLES AND DELTA VIEW GENERATION...
 REAL-TIME CHANGE DATA CAPTURE USING STAGING TABLES AND DELTA VIEW GENERATION... REAL-TIME CHANGE DATA CAPTURE USING STAGING TABLES AND DELTA VIEW GENERATION...
REAL-TIME CHANGE DATA CAPTURE USING STAGING TABLES AND DELTA VIEW GENERATION...
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
 
Queues, Pools and Caches - Paper
Queues, Pools and Caches - PaperQueues, Pools and Caches - Paper
Queues, Pools and Caches - Paper
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Use Big Data Technologies to Modernize Your Enterprise Data Warehouse
Use Big Data Technologies to Modernize Your Enterprise Data Warehouse Use Big Data Technologies to Modernize Your Enterprise Data Warehouse
Use Big Data Technologies to Modernize Your Enterprise Data Warehouse
 
VTU 7TH SEM CSE DATA WAREHOUSING AND DATA MINING SOLVED PAPERS OF DEC2013 JUN...
VTU 7TH SEM CSE DATA WAREHOUSING AND DATA MINING SOLVED PAPERS OF DEC2013 JUN...VTU 7TH SEM CSE DATA WAREHOUSING AND DATA MINING SOLVED PAPERS OF DEC2013 JUN...
VTU 7TH SEM CSE DATA WAREHOUSING AND DATA MINING SOLVED PAPERS OF DEC2013 JUN...
 
Data warehouse design
Data warehouse designData warehouse design
Data warehouse design
 
Big data rmoug
Big data rmougBig data rmoug
Big data rmoug
 
E018142329
E018142329E018142329
E018142329
 
In-Memory Data Grids: Explained...
In-Memory Data Grids: Explained...In-Memory Data Grids: Explained...
In-Memory Data Grids: Explained...
 
Hadoop(Term Paper)
Hadoop(Term Paper)Hadoop(Term Paper)
Hadoop(Term Paper)
 
GridGain & Hadoop: Differences & Synergies
GridGain & Hadoop: Differences & SynergiesGridGain & Hadoop: Differences & Synergies
GridGain & Hadoop: Differences & Synergies
 

Viewers also liked

BI SaaS & Cloud Strategies for Telcos
BI SaaS & Cloud Strategies for TelcosBI SaaS & Cloud Strategies for Telcos
BI SaaS & Cloud Strategies for TelcosDavid Walker
 
An introduction to social network data
An introduction to social network dataAn introduction to social network data
An introduction to social network dataDavid Walker
 
The ABC of Data Governance: driving Information Excellence
The ABC of Data Governance: driving Information ExcellenceThe ABC of Data Governance: driving Information Excellence
The ABC of Data Governance: driving Information ExcellenceAlan D. Duncan
 
Implementing Netezza Spatial
Implementing Netezza SpatialImplementing Netezza Spatial
Implementing Netezza SpatialDavid Walker
 
Data Driven Insurance Underwriting (Dutch Language Version)
Data Driven Insurance Underwriting (Dutch Language Version)Data Driven Insurance Underwriting (Dutch Language Version)
Data Driven Insurance Underwriting (Dutch Language Version)David Walker
 
Building an analytical platform
Building an analytical platformBuilding an analytical platform
Building an analytical platformDavid Walker
 
LL Higher Ed BI 2014 Key BI Market Trends 20140513a
LL Higher Ed BI 2014 Key BI Market Trends 20140513aLL Higher Ed BI 2014 Key BI Market Trends 20140513a
LL Higher Ed BI 2014 Key BI Market Trends 20140513aAlan D. Duncan
 
Basics of Microsoft Business Intelligence and Data Integration Techniques
Basics of Microsoft Business Intelligence and Data Integration TechniquesBasics of Microsoft Business Intelligence and Data Integration Techniques
Basics of Microsoft Business Intelligence and Data Integration TechniquesValmik Potbhare
 
Data Driven Insurance Underwriting
Data Driven Insurance UnderwritingData Driven Insurance Underwriting
Data Driven Insurance UnderwritingDavid Walker
 
Data warehousing change in a challenging environment
Data warehousing change in a challenging environmentData warehousing change in a challenging environment
Data warehousing change in a challenging environmentDavid Walker
 
Igqie14 analytics and ethics 20141107
Igqie14   analytics and ethics 20141107Igqie14   analytics and ethics 20141107
Igqie14 analytics and ethics 20141107Alan D. Duncan
 
The one question you must never ask!" (Information Requirements Gathering for...
The one question you must never ask!" (Information Requirements Gathering for...The one question you must never ask!" (Information Requirements Gathering for...
The one question you must never ask!" (Information Requirements Gathering for...Alan D. Duncan
 
04. Logical Data Definition template
04. Logical Data Definition template04. Logical Data Definition template
04. Logical Data Definition templateAlan D. Duncan
 
WHITE PAPER: Distributed Data Quality
WHITE PAPER: Distributed Data QualityWHITE PAPER: Distributed Data Quality
WHITE PAPER: Distributed Data QualityAlan D. Duncan
 
02. Information solution outline template
02. Information solution outline template02. Information solution outline template
02. Information solution outline templateAlan D. Duncan
 
Example data specifications and info requirements framework OVERVIEW
Example data specifications and info requirements framework OVERVIEWExample data specifications and info requirements framework OVERVIEW
Example data specifications and info requirements framework OVERVIEWAlan D. Duncan
 
05. Physical Data Specification Template
05. Physical Data Specification Template05. Physical Data Specification Template
05. Physical Data Specification TemplateAlan D. Duncan
 
Managing for Effective Data Governance: workshop for DQ Asia Pacific Congress...
Managing for Effective Data Governance: workshop for DQ Asia Pacific Congress...Managing for Effective Data Governance: workshop for DQ Asia Pacific Congress...
Managing for Effective Data Governance: workshop for DQ Asia Pacific Congress...Alan D. Duncan
 
Moving From Scorecards To Strategic Management
Moving From Scorecards To Strategic ManagementMoving From Scorecards To Strategic Management
Moving From Scorecards To Strategic ManagementWynyard Group
 
06. Transformation Logic Template (Source to Target)
06. Transformation Logic Template (Source to Target)06. Transformation Logic Template (Source to Target)
06. Transformation Logic Template (Source to Target)Alan D. Duncan
 

Viewers also liked (20)

BI SaaS & Cloud Strategies for Telcos
BI SaaS & Cloud Strategies for TelcosBI SaaS & Cloud Strategies for Telcos
BI SaaS & Cloud Strategies for Telcos
 
An introduction to social network data
An introduction to social network dataAn introduction to social network data
An introduction to social network data
 
The ABC of Data Governance: driving Information Excellence
The ABC of Data Governance: driving Information ExcellenceThe ABC of Data Governance: driving Information Excellence
The ABC of Data Governance: driving Information Excellence
 
Implementing Netezza Spatial
Implementing Netezza SpatialImplementing Netezza Spatial
Implementing Netezza Spatial
 
Data Driven Insurance Underwriting (Dutch Language Version)
Data Driven Insurance Underwriting (Dutch Language Version)Data Driven Insurance Underwriting (Dutch Language Version)
Data Driven Insurance Underwriting (Dutch Language Version)
 
Building an analytical platform
Building an analytical platformBuilding an analytical platform
Building an analytical platform
 
LL Higher Ed BI 2014 Key BI Market Trends 20140513a
LL Higher Ed BI 2014 Key BI Market Trends 20140513aLL Higher Ed BI 2014 Key BI Market Trends 20140513a
LL Higher Ed BI 2014 Key BI Market Trends 20140513a
 
Basics of Microsoft Business Intelligence and Data Integration Techniques
Basics of Microsoft Business Intelligence and Data Integration TechniquesBasics of Microsoft Business Intelligence and Data Integration Techniques
Basics of Microsoft Business Intelligence and Data Integration Techniques
 
Data Driven Insurance Underwriting
Data Driven Insurance UnderwritingData Driven Insurance Underwriting
Data Driven Insurance Underwriting
 
Data warehousing change in a challenging environment
Data warehousing change in a challenging environmentData warehousing change in a challenging environment
Data warehousing change in a challenging environment
 
Igqie14 analytics and ethics 20141107
Igqie14   analytics and ethics 20141107Igqie14   analytics and ethics 20141107
Igqie14 analytics and ethics 20141107
 
The one question you must never ask!" (Information Requirements Gathering for...
The one question you must never ask!" (Information Requirements Gathering for...The one question you must never ask!" (Information Requirements Gathering for...
The one question you must never ask!" (Information Requirements Gathering for...
 
04. Logical Data Definition template
04. Logical Data Definition template04. Logical Data Definition template
04. Logical Data Definition template
 
WHITE PAPER: Distributed Data Quality
WHITE PAPER: Distributed Data QualityWHITE PAPER: Distributed Data Quality
WHITE PAPER: Distributed Data Quality
 
02. Information solution outline template
02. Information solution outline template02. Information solution outline template
02. Information solution outline template
 
Example data specifications and info requirements framework OVERVIEW
Example data specifications and info requirements framework OVERVIEWExample data specifications and info requirements framework OVERVIEW
Example data specifications and info requirements framework OVERVIEW
 
05. Physical Data Specification Template
05. Physical Data Specification Template05. Physical Data Specification Template
05. Physical Data Specification Template
 
Managing for Effective Data Governance: workshop for DQ Asia Pacific Congress...
Managing for Effective Data Governance: workshop for DQ Asia Pacific Congress...Managing for Effective Data Governance: workshop for DQ Asia Pacific Congress...
Managing for Effective Data Governance: workshop for DQ Asia Pacific Congress...
 
Moving From Scorecards To Strategic Management
Moving From Scorecards To Strategic ManagementMoving From Scorecards To Strategic Management
Moving From Scorecards To Strategic Management
 
06. Transformation Logic Template (Source to Target)
06. Transformation Logic Template (Source to Target)06. Transformation Logic Template (Source to Target)
06. Transformation Logic Template (Source to Target)
 

Similar to Building a data warehouse of call data records

Nw2008 tips tricks_edw_v10
Nw2008 tips tricks_edw_v10Nw2008 tips tricks_edw_v10
Nw2008 tips tricks_edw_v10Harsha Gowda B R
 
Traditional data word
Traditional data wordTraditional data word
Traditional data wordorcoxsm
 
Current trends in dbms
Current trends in dbmsCurrent trends in dbms
Current trends in dbmsDaisy Joy
 
Webcast Q&A- Big Data Architectures Beyond Hadoop
Webcast Q&A- Big Data Architectures Beyond HadoopWebcast Q&A- Big Data Architectures Beyond Hadoop
Webcast Q&A- Big Data Architectures Beyond HadoopImpetus Technologies
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackAnant Corporation
 
Relational Databases For An Efficient Data Management And...
Relational Databases For An Efficient Data Management And...Relational Databases For An Efficient Data Management And...
Relational Databases For An Efficient Data Management And...Sheena Crouch
 
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)Denodo
 
Efficient and scalable multitenant placement approach for in memory database ...
Efficient and scalable multitenant placement approach for in memory database ...Efficient and scalable multitenant placement approach for in memory database ...
Efficient and scalable multitenant placement approach for in memory database ...CSITiaesprime
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemGregg Barrett
 
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)Denodo
 
Steps to Modernize Your Data Ecosystem | Mindtree
Steps to Modernize Your Data Ecosystem | Mindtree									Steps to Modernize Your Data Ecosystem | Mindtree
Steps to Modernize Your Data Ecosystem | Mindtree AnikeyRoy
 
Six Steps to Modernize Your Data Ecosystem - Mindtree
Six Steps to Modernize Your Data Ecosystem  - MindtreeSix Steps to Modernize Your Data Ecosystem  - Mindtree
Six Steps to Modernize Your Data Ecosystem - Mindtreesamirandev1
 
6 Steps to Modernize Data Ecosystem with Mindtree
6 Steps to Modernize Data Ecosystem with Mindtree6 Steps to Modernize Data Ecosystem with Mindtree
6 Steps to Modernize Data Ecosystem with Mindtreedevraajsingh
 
Steps to Modernize Your Data Ecosystem with Mindtree Blog
Steps to Modernize Your Data Ecosystem with Mindtree Blog Steps to Modernize Your Data Ecosystem with Mindtree Blog
Steps to Modernize Your Data Ecosystem with Mindtree Blog sameerroshan
 
Introduction to Bigdata and NoSQL
Introduction to Bigdata and NoSQLIntroduction to Bigdata and NoSQL
Introduction to Bigdata and NoSQLTushar Shende
 
How to select a modern data warehouse and get the most out of it?
How to select a modern data warehouse and get the most out of it?How to select a modern data warehouse and get the most out of it?
How to select a modern data warehouse and get the most out of it?Slim Baltagi
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKRajesh Jayarman
 
The Rise of Nosql Databases
The Rise of Nosql DatabasesThe Rise of Nosql Databases
The Rise of Nosql DatabasesJAMES NGONDO
 

Similar to Building a data warehouse of call data records (20)

Nw2008 tips tricks_edw_v10
Nw2008 tips tricks_edw_v10Nw2008 tips tricks_edw_v10
Nw2008 tips tricks_edw_v10
 
Traditional data word
Traditional data wordTraditional data word
Traditional data word
 
Current trends in dbms
Current trends in dbmsCurrent trends in dbms
Current trends in dbms
 
Webcast Q&A- Big Data Architectures Beyond Hadoop
Webcast Q&A- Big Data Architectures Beyond HadoopWebcast Q&A- Big Data Architectures Beyond Hadoop
Webcast Q&A- Big Data Architectures Beyond Hadoop
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data Stack
 
Relational Databases For An Efficient Data Management And...
Relational Databases For An Efficient Data Management And...Relational Databases For An Efficient Data Management And...
Relational Databases For An Efficient Data Management And...
 
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
 
Efficient and scalable multitenant placement approach for in memory database ...
Efficient and scalable multitenant placement approach for in memory database ...Efficient and scalable multitenant placement approach for in memory database ...
Efficient and scalable multitenant placement approach for in memory database ...
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystem
 
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
 
Report 2.0.docx
Report 2.0.docxReport 2.0.docx
Report 2.0.docx
 
Steps to Modernize Your Data Ecosystem | Mindtree
Steps to Modernize Your Data Ecosystem | Mindtree									Steps to Modernize Your Data Ecosystem | Mindtree
Steps to Modernize Your Data Ecosystem | Mindtree
 
Six Steps to Modernize Your Data Ecosystem - Mindtree
Six Steps to Modernize Your Data Ecosystem  - MindtreeSix Steps to Modernize Your Data Ecosystem  - Mindtree
Six Steps to Modernize Your Data Ecosystem - Mindtree
 
6 Steps to Modernize Data Ecosystem with Mindtree
6 Steps to Modernize Data Ecosystem with Mindtree6 Steps to Modernize Data Ecosystem with Mindtree
6 Steps to Modernize Data Ecosystem with Mindtree
 
Steps to Modernize Your Data Ecosystem with Mindtree Blog
Steps to Modernize Your Data Ecosystem with Mindtree Blog Steps to Modernize Your Data Ecosystem with Mindtree Blog
Steps to Modernize Your Data Ecosystem with Mindtree Blog
 
Introduction to Bigdata and NoSQL
Introduction to Bigdata and NoSQLIntroduction to Bigdata and NoSQL
Introduction to Bigdata and NoSQL
 
How to select a modern data warehouse and get the most out of it?
How to select a modern data warehouse and get the most out of it?How to select a modern data warehouse and get the most out of it?
How to select a modern data warehouse and get the most out of it?
 
1 ieee98
1 ieee981 ieee98
1 ieee98
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RK
 
The Rise of Nosql Databases
The Rise of Nosql DatabasesThe Rise of Nosql Databases
The Rise of Nosql Databases
 

More from David Walker

Moving To MicroServices
Moving To MicroServicesMoving To MicroServices
Moving To MicroServicesDavid Walker
 
Big Data Week 2016 - Worldpay - Deploying Secure Clusters
Big Data Week 2016  - Worldpay - Deploying Secure ClustersBig Data Week 2016  - Worldpay - Deploying Secure Clusters
Big Data Week 2016 - Worldpay - Deploying Secure ClustersDavid Walker
 
Data Works Berlin 2018 - Worldpay - PCI Compliance
Data Works Berlin 2018 - Worldpay - PCI ComplianceData Works Berlin 2018 - Worldpay - PCI Compliance
Data Works Berlin 2018 - Worldpay - PCI ComplianceDavid Walker
 
Data Works Summit Munich 2017 - Worldpay - Multi Tenancy Clusters
Data Works Summit Munich 2017 - Worldpay - Multi Tenancy ClustersData Works Summit Munich 2017 - Worldpay - Multi Tenancy Clusters
Data Works Summit Munich 2017 - Worldpay - Multi Tenancy ClustersDavid Walker
 
Big Data Analytics 2017 - Worldpay - Empowering Payments
Big Data Analytics 2017  - Worldpay - Empowering PaymentsBig Data Analytics 2017  - Worldpay - Empowering Payments
Big Data Analytics 2017 - Worldpay - Empowering PaymentsDavid Walker
 
An introduction to data virtualization in business intelligence
An introduction to data virtualization in business intelligenceAn introduction to data virtualization in business intelligence
An introduction to data virtualization in business intelligenceDavid Walker
 
Gathering Business Requirements for Data Warehouses
Gathering Business Requirements for Data WarehousesGathering Business Requirements for Data Warehouses
Gathering Business Requirements for Data WarehousesDavid Walker
 
A linux mac os x command line interface
A linux mac os x command line interfaceA linux mac os x command line interface
A linux mac os x command line interfaceDavid Walker
 
Connections a life in the day of - david walker
Connections   a life in the day of - david walkerConnections   a life in the day of - david walker
Connections a life in the day of - david walkerDavid Walker
 
Conspectus data warehousing appliances – fad or future
Conspectus   data warehousing appliances – fad or futureConspectus   data warehousing appliances – fad or future
Conspectus data warehousing appliances – fad or futureDavid Walker
 
Using the right data model in a data mart
Using the right data model in a data martUsing the right data model in a data mart
Using the right data model in a data martDavid Walker
 
UKOUG06 - An Introduction To Process Neutral Data Modelling - Presentation
UKOUG06 - An Introduction To Process Neutral Data Modelling - PresentationUKOUG06 - An Introduction To Process Neutral Data Modelling - Presentation
UKOUG06 - An Introduction To Process Neutral Data Modelling - PresentationDavid Walker
 
Oracle BI06 From Volume To Value - Presentation
Oracle BI06   From Volume To Value - PresentationOracle BI06   From Volume To Value - Presentation
Oracle BI06 From Volume To Value - PresentationDavid Walker
 
Openworld04 - Information Delivery - The Change In Data Management At Network...
Openworld04 - Information Delivery - The Change In Data Management At Network...Openworld04 - Information Delivery - The Change In Data Management At Network...
Openworld04 - Information Delivery - The Change In Data Management At Network...David Walker
 
IRM09 - What Can IT Really Deliver For BI and DW - Presentation
IRM09 - What Can IT Really Deliver For BI and DW - PresentationIRM09 - What Can IT Really Deliver For BI and DW - Presentation
IRM09 - What Can IT Really Deliver For BI and DW - PresentationDavid Walker
 
IOUG93 - Technical Architecture for the Data Warehouse - Presentation
IOUG93 - Technical Architecture for the Data Warehouse - PresentationIOUG93 - Technical Architecture for the Data Warehouse - Presentation
IOUG93 - Technical Architecture for the Data Warehouse - PresentationDavid Walker
 
ETIS11 - Enterprise Metadata Management
ETIS11 -  Enterprise Metadata ManagementETIS11 -  Enterprise Metadata Management
ETIS11 - Enterprise Metadata ManagementDavid Walker
 
ETIS11 - Agile Business Intelligence - Presentation
ETIS11 -  Agile Business Intelligence - PresentationETIS11 -  Agile Business Intelligence - Presentation
ETIS11 - Agile Business Intelligence - PresentationDavid Walker
 
ETIS10 - BI Governance Models & Strategies - Presentation
ETIS10 - BI Governance Models & Strategies - PresentationETIS10 - BI Governance Models & Strategies - Presentation
ETIS10 - BI Governance Models & Strategies - PresentationDavid Walker
 
ETIS10 - BI Business Requirements - Presentation
ETIS10 - BI Business Requirements - PresentationETIS10 - BI Business Requirements - Presentation
ETIS10 - BI Business Requirements - PresentationDavid Walker
 

More from David Walker (20)

Moving To MicroServices
Moving To MicroServicesMoving To MicroServices
Moving To MicroServices
 
Big Data Week 2016 - Worldpay - Deploying Secure Clusters
Big Data Week 2016  - Worldpay - Deploying Secure ClustersBig Data Week 2016  - Worldpay - Deploying Secure Clusters
Big Data Week 2016 - Worldpay - Deploying Secure Clusters
 
Data Works Berlin 2018 - Worldpay - PCI Compliance
Data Works Berlin 2018 - Worldpay - PCI ComplianceData Works Berlin 2018 - Worldpay - PCI Compliance
Data Works Berlin 2018 - Worldpay - PCI Compliance
 
Data Works Summit Munich 2017 - Worldpay - Multi Tenancy Clusters
Data Works Summit Munich 2017 - Worldpay - Multi Tenancy ClustersData Works Summit Munich 2017 - Worldpay - Multi Tenancy Clusters
Data Works Summit Munich 2017 - Worldpay - Multi Tenancy Clusters
 
Big Data Analytics 2017 - Worldpay - Empowering Payments
Big Data Analytics 2017  - Worldpay - Empowering PaymentsBig Data Analytics 2017  - Worldpay - Empowering Payments
Big Data Analytics 2017 - Worldpay - Empowering Payments
 
An introduction to data virtualization in business intelligence
An introduction to data virtualization in business intelligenceAn introduction to data virtualization in business intelligence
An introduction to data virtualization in business intelligence
 
Gathering Business Requirements for Data Warehouses
Gathering Business Requirements for Data WarehousesGathering Business Requirements for Data Warehouses
Gathering Business Requirements for Data Warehouses
 
A linux mac os x command line interface
A linux mac os x command line interfaceA linux mac os x command line interface
A linux mac os x command line interface
 
Connections a life in the day of - david walker
Connections   a life in the day of - david walkerConnections   a life in the day of - david walker
Connections a life in the day of - david walker
 
Conspectus data warehousing appliances – fad or future
Conspectus   data warehousing appliances – fad or futureConspectus   data warehousing appliances – fad or future
Conspectus data warehousing appliances – fad or future
 
Using the right data model in a data mart
Using the right data model in a data martUsing the right data model in a data mart
Using the right data model in a data mart
 
UKOUG06 - An Introduction To Process Neutral Data Modelling - Presentation
UKOUG06 - An Introduction To Process Neutral Data Modelling - PresentationUKOUG06 - An Introduction To Process Neutral Data Modelling - Presentation
UKOUG06 - An Introduction To Process Neutral Data Modelling - Presentation
 
Oracle BI06 From Volume To Value - Presentation
Oracle BI06   From Volume To Value - PresentationOracle BI06   From Volume To Value - Presentation
Oracle BI06 From Volume To Value - Presentation
 
Openworld04 - Information Delivery - The Change In Data Management At Network...
Openworld04 - Information Delivery - The Change In Data Management At Network...Openworld04 - Information Delivery - The Change In Data Management At Network...
Openworld04 - Information Delivery - The Change In Data Management At Network...
 
IRM09 - What Can IT Really Deliver For BI and DW - Presentation
IRM09 - What Can IT Really Deliver For BI and DW - PresentationIRM09 - What Can IT Really Deliver For BI and DW - Presentation
IRM09 - What Can IT Really Deliver For BI and DW - Presentation
 
IOUG93 - Technical Architecture for the Data Warehouse - Presentation
IOUG93 - Technical Architecture for the Data Warehouse - PresentationIOUG93 - Technical Architecture for the Data Warehouse - Presentation
IOUG93 - Technical Architecture for the Data Warehouse - Presentation
 
ETIS11 - Enterprise Metadata Management
ETIS11 -  Enterprise Metadata ManagementETIS11 -  Enterprise Metadata Management
ETIS11 - Enterprise Metadata Management
 
ETIS11 - Agile Business Intelligence - Presentation
ETIS11 -  Agile Business Intelligence - PresentationETIS11 -  Agile Business Intelligence - Presentation
ETIS11 - Agile Business Intelligence - Presentation
 
ETIS10 - BI Governance Models & Strategies - Presentation
ETIS10 - BI Governance Models & Strategies - PresentationETIS10 - BI Governance Models & Strategies - Presentation
ETIS10 - BI Governance Models & Strategies - Presentation
 
ETIS10 - BI Business Requirements - Presentation
ETIS10 - BI Business Requirements - PresentationETIS10 - BI Business Requirements - Presentation
ETIS10 - BI Business Requirements - Presentation
 

Recently uploaded

Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 

Recently uploaded (20)

Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 

Building a data warehouse of call data records

  • 1. Building  a  Data  Warehouse  of  Call  Data  Records  (CDR):  Solution  Considerations     With  many  organisations  having  established  data  warehouse  solutions  there   comes  a  point  where  the  solution  needs  to  be  extended  in  an  unusual  direction.   This  challenge  was  given  to  us  by  one  of  our  clients  a  European  Mobile  Virtual   Network  Operator  (MVNO)  who  wanted  to  add  both  compliance  and  archiving   functionality  to  their  existing  data  warehouse.       If  you  don’t  know,  a  MVNO  is  a  mobile  phone  operator  that  directly  services  its   own  customers  but  does  not  own  the  network  assets,  instead  leasing  these  from   a  mobile  (non-­‐virtual)  network  operator.  The  company’s  profits  are  derived   from  the  margin  between  the  price  a  subscriber  pays  to  the  MVNO  and  the   wholesale  cost  of  purchasing  capacity  from  the  underlying  host  operator.     MVNOs  differentiate  themselves  by  very  competitive  pricing  on  certain  parts  of   the  call  package,  for  example  ethnic  MVNOs  target  ethnic  communities  by   providing  inexpensive  calls  to  their  home  country.  To  enable  such  offerings  and   to  track  costs,  especially  where  subscribers  start  using  the  service  in  unexpected   and  possibly  costly  ways,  is  a  key  business  function  and  this  is  where  the  data   warehouse  is  used.     The  primary  role  of  the  data  warehouse  is  to  profile  subscribers  with  their   associated  attributes  and  their  call  usage  pattern.  Our  client  has  this  built  in  a   database  that  provided  them  with  all  of  the  subscriber  information  and  three   months  of  call  detail  records  (CDRs)  and  data  detail  records  (IPDRs).  Beyond  the   three-­‐month  limit  there  was  aggregate  data  about  calls  but  no  direct  access  to   the  original  call  records.     This  caused  the  business  a  number  of  challenges:     1. To  provide  a  cost  effective  query-­‐able  archive  of  CDR/IPs.     The  original  solution  archived  the  incoming  CDR  files  to  disk  and   subsequently  to  CD.  This  meant  that  if  specific  set  of  records  were   required  they  had  to  be  recovered,  re-­‐loaded  and  then  queried.     2. To  enable  data  retention  law  compliance     EU  law  (amongst  others)  requires  Telcos  to  retain  call  and  data  detail   records  for  a  period  between  6  and  24  months.  Whilst  the  MVNO  was  in   compliance  with  the  regulations,  much  of  the  data  was  stored  on  CD  and   so  suffered  from  the  same  problem  as  archive  data.     3. Enhanced  CDR  analysis  in  the  existing  data  warehouse     The  business  was  looking  to  be  able  to  analyse  more  of  the  CDR  data  on   an  ad  hoc  basis  without  the  need  to  store  more  data  in  the  main  data   warehouse.  They  also  wanted  to  be  able  to  change  the  stored  aggregates  
  • 2. quickly  and  easily  when  new  data  patterns  were  found.  Previously  this   had  again  required  the  recovery  and  re-­‐loading  of  archived  data.     The  objective  of  the  project  was  to  design  a  solution  that  allowed  CDRs  to  be   stored  for  immediate  access  for  at  least  two  years  and  that  could  easily  be   extended  to  support  up  to  seven  years  of  data.  As  a  consequence  it  is  also   envisioned  that  the  solution  would  improve  the  performance  of  aggregate   load  and  rebuild  when  required  because  the  CDRs  are  available  online.       Solution  Considerations     Given  the  business  requirements  a  number  of  technical  aspects  were  then   considered.  These  can  be  outlined  as  follows:     1) Should  Hadoop  or  NoSQL  solutions  be  used?     Hadoop  and  other  NoSQL  solutions  have  grabbed  a  significant  number  of   column  inches  over  the  past  couple  of  years  and  in  the  right  circumstances   they  offer  a  very  cost  effective  and  high  performance  solution.  The  discussion   about  the  use  of  a  Hadoop  or  NoSQL  solution  centred  on  the  following   aspects:     a) A  strength  of  Hadoop  and  NoSQL  solutions  is  in  allowing  easy  mapping  of   unstructured  data.  The  data  files  in  use  had  a  strong  pre-­‐defined  structure   –  and  importantly  any  change  to  this  structure  would  signify  a  material   change  across  a  number  of  business  systems.  There  is  no  unstructured   data  that  we  wish  to  analyse.     b) There  are  no  existing  skills  within  the  IT  organisation.  This  in  itself  should   not  be  a  blocker  to  technological  innovation  within  a  company.  However,   this  particular  project  is  about  maintaining  business  as  usual  and  can’t  be   used  as  a  vehicle  for  internal  training.     c) NoSQL  –  well  exactly  that  –  not  only  does  the  IT  department  have  no  skills   but  our  user  community  has  a  range  of  SQL  tools  –  if  the  NoSQL  route  was   selected  then  any  data  extraction  would  fall  back  onto  the  IT  department   rather  than  self  service  by  business  users.     d) The  business  has  a  range  of  Business  Intelligence  tools  such  as  Business   Objects  and  these  require  an  SQL  database.  It  was  noted  that  at  this  point   in  time  there  are  no  significant  BI  tools  that  work  with  NoSQL  solutions     e) Internal  perception.  Some  companies  choose  to  be  at  the  vanguard  or   bleeding  edge  of  technology,  others  are  the  last  to  change  or  upgrade.  This   company  favours  the  middle  ground  of  choosing  established  technologies   that  offer  a  route  to  the  future.     2) If  it  is  a  database  then  what  type  of  database?    
  • 3. Once  the  Hadoop  debate  had  been  put  to  rest  the  discussion  turned  to  what   type  of  database  to  use.  There  were  three  options  that  were  considered.       a) Row  Store  Databases     A  row  store  database  is  the  type  of  database  that  most  people  will  be   familiar  with.  Each  record  is  stored  as  a  row  in  a  table  and  queried  using   SQL.  The  organisation  has  a  number  of  different  row  store  databases   platforms  for  operational  systems  however  the  existing  Data  Warehouse   is  based  on  Sybase  ASE  so  this  would  be  the  natural  choice.  The  primary   benefit  of  this  would  be  to  continue  to  use  existing  technology,  however   as  a  large  database  it  would  not  necessarily  be  performant  and  would   require  significantly  more  disk  space  than  the  current  raw  files  as  both   data  and  index  space  would  be  required.       b) MPP  based  data  warehouse  appliances     There  are  a  number  of  massively  parallel  processing  (MPP)  architectures.   These  also  store  the  data  in  rows  however  they  apply  many  CPUs  with   data  on  disks  close  to  the  CPU  to  speed  up  the  problem.  Whilst  these  can   be  very  attractive  solutions  they  posed  two  specific  problems  to  the   project.       Firstly  it  would  not  be  cost  effective  to  retain  the  existing  data  warehouse   on  the  current  platform  and  create  an  archive  store  on  an  MPP  solution   because  MPP  solutions  generally  match  CPU  and  disk  together  for   performance.  Therefore  to  get  high  performance  there  would  be  a   significant  under-­‐utilisation  of  the  available  disk  resource.     Secondly  (and  as  a  consequence  of  the  first  issue)  if  the  project  were  to   move  the  existing  data  warehouse  to  an  MPP  platform  it  would  require   porting/rewriting  of  the  load  processes  (also  known  as  ETL)  that  would   add  cost  and  time.  This  was  exacerbated  by  the  use  of  platform  specific   user  defined  functions  (UDF)  in  the  current  data  warehouse  that  are  not   available  on  the  MPP  platforms.     c) Column-­‐Store  Databases     The  third  option  was  the  choice  of  a  column  store  database.  To  the  user   these  look  identical  to  a  row  based  database  but  behind  the  scenes  the   data  is  actually  stored  in  its  columns  with  a  record  of  data  being  made  up   of  pointers  to  the  data  held  in  the  columns.     Column  store  databases  consequently  store  only  one  copy  of  each  value  in   a  column.  If  you  have  a  column  in  a  table  that  contains  the  calling   telephone  number  then  each  telephone  number  will  be  repeated  many   times.  Storing  it  only  once  and  using  a  pointer  to  the  string  considerably   reduces  space  and  because  there  is  less  space  used  there  is  less  I/O  
  • 4. required  to  retrieve  the  data  from  the  disk  that  also  speeds  up  the  query   time.  Both  a  smaller  disk  foot  print  and  faster  query  time  make  column   store  databases  very  attractive.     Column  store  databases  are  considered  more  difficult  than  traditional   row  based  databases  to  update  because  the  update  has  to  check  the  data   in  the  record  map  and  in  the  column  store  itself.  This  is  not  an  issue  for   data  warehousing  and  archiving  solutions  where  the  data  is  usually   loaded  once  and  not  updated.     Despite  a  lack  of  general  awareness  of  column  store  databases  they  have   been  around  since  the  mid  1970s  and  Wikipedia  currently  lists  21   commercial  options  and  8  open  source  solutions.  The  most  established   column  store  database  in  the  market  is  Sybase  IQ,  which  fitted  nicely  with   the  existing  data  warehouse  on  Sybase  ASE.     Based  on  the  choices  the  client  chose  to  go  with  a  column-­‐based  database.  The   next  stage  of  the  project  was  to  build  a  prototype  Sybase  IQ  store  to  understand   what  sort  of  performance  and  compression  we  obtained.       Solution  Design     The  solution  design  was  to  create  a  simple  schema  with  two  types  of  tables,  one   for  the  CDR  data  and  one  for  the  IPDR  data.  It  was  also  decided  that  the  tables   would  be  split  into  multiple  datasets  based  on  the  start  date  of  the  call  or  data   session.  As  a  result  we  ended  with  a  number  of  tables  of  the  form  CDR_2012,   CDR_2011,  etc.  and  IPDR_2012,  IPDR_2011.       Views  were  created  over  these  tables,  called  CDR  over  the  CDR_XXXX  tables  and   IPDR  for  the  IPDR_XXXX  tables.  A  further  view  was  created  called   DATA_RECORDS  across  all  CDR_XXXX  and  IPDR_XXXX  tables  but  only  selecting   the  fields  that  were  common  to  both  (start  time,  cost,  record  type,  etc.).  This  was   done  for  future  ease  of  management,  as  it  would  be  possible  to  remove  an  entire   year  of  data  simply  by  rebuilding  the  view  to  exclude  the  year  and  then  dropping   the  table.     One  year  of  call  records  consisted  of  around  500GB  of  raw  data  and  1  Billion   rows  of  CDRs.  When  loaded  into  the  database  and  it  took  up  less  than  100GB   (achieving  a  compression  ratio  of  5.7:1).  Similar  results  were  obtained  for  the   IPDRs.    Therefore  a  2TB  system  (4TB  when  mirrored)  could  effectively  hold  7   years  of  history  even  allowing  for  predicted  growth.     The  testing  split  the  performance  into  two  aspects  –  the  load  time  and  query   performance.       The  CDRs  are  delivered  from  the  host  operator  on  an  hourly  schedule  as  fixed   format  flat  ASCII  file  that  typically  contained  around  35,000  records,  where  each   record  was  583  characters  long  containing  80  fields.  Loading  an  average  file   (typically  35K  records  in  a  2MB  file)  took  around  1  second  and  since  there  is  
  • 5. around  200  files  per  day  this  was  not  a  burden  on  the  system,  even  allowing  data   to  be  loaded  throughout  the  day  rather  than  a  single  nightly  batch  to  allow  more   immediate  access  to  the  CDRs.     Initially  two  years  of  data  was  loaded  although  this  will  be  allowed  to  grow  over   the  coming  years.  User  queries  directly  on  the  Sybase  IQ  database  ranged   dramatically  depending  on  the  type  and  complexity  but  were  typically  10  to  15   times  faster  than  the  equivalent  Sybase  ASE  queries.  In  addition  the  queries  were   working  on  a  24-­‐month  data  set  rather  than  the  original  3-­‐month  data  set.     The  existing  data  warehouse  was  left  in  situ  and  using  Sybase  ASE  with  Sybase   IQ  permitted  the  users  to  access  both  data  sets  from  a  single  unified  interface.     The  aggregate  load,  held  in  the  existing  Sybase  ASE  Data  Warehouse,  was  also   modified  to  build  from  the  Sybase  IQ  store  rather  than  the  raw  files  and  this   reduced  load  complexity  and  allowed  the  IT  team  to  be  more  responsive  in   delivering  new  aggregates.     In  summary  a  CDR/IPDR  archive  was  created  on  a  second  server  with  directly   attached  storage,  roughly  equivalent  in  size  and  performance  to  the  original  data   warehouse  platform,  that  was  accessible  via  SQL,  allowed  users  access  from   within  the  current  environment,  compressed  the  data  by  more  the  five  times  and   was  typically  ten  times  faster  to  query.           About  the  author     David   Walker   has   been   involved   with   Business   Intelligence   and   Data   Warehousing   for   over   20   years,   first   as   a   user,   then   with   a   software  house   and   finally  with  a  hardware  vendor  before  setting  up  his  own  consultancy  firm  Data   Management  &  Warehousing  (http://www.datamgmt.com)  in  1995.     David   and   his   team   have   worked   around   the   world   on   projects   designed   to   deliver  the  maximum  benefit  to  the  business  by  converting  data  into  information   and  finding  innovative  ways  in  which  to  help  businesses  exploit  that  information.     David’s  project  work  has  given  him  wide  industry  experiences  including  Telcos,   Finance,   Retail,   Manufacturing,   Transport   and   Public   Sector   as   well   as   a   broad   and   deep   knowledge   of   Business   Intelligence   and   Data   Warehousing   technologies.     David   has   written   extensively   on   all   aspects   of   data   warehousing   from   the   technical   architecture   and   data   model   design   through   ETL   and   data   quality   to   project  management  and  governance.