On demand access to Big Data through Semantic Technologies


Published on

Presentation at Semantic Days
Stavanger, Norway, May 9, 2012

Published in: Technology, Business

On demand access to Big Data through Semantic Technologies

  1. 1. ON  DEMAND  ACCESS  TO  BIG  DATA  THROUGH  SEMANTIC  TECHNOLOGIES    Peter  Haase  fluid  Opera/ons  AG!
  2. 2. fluid  Opera/ons  (fluidOps)    Linked  Data  &  Seman;c  Technologies                                  Enterprise  Cloud  Compu;ng        So7ware  company  founded  Q1/2008  by  team  of  serial  entrepreneurs,  privately  held,  VC  funded  Headquarters  in  Walldorf  /  Germany,  SAP  Partner  Port  Currently  45  employees  Named  “Cool  Vendor”  by  Gartner  Mar  2010  Global  reseller  agreement  with  EMC  focus  large    enterprise  customers  Apr  2010  NetApp  Advantage  Alliance  Partner  Oct  2010  
  3. 3. Outline  •  Big  Data  Challenges:  Beyond  Volume  •  Seman;c  Technologies  for  Big  Data  Challenges  •  On  Demand  Data  Access  in  a  Self-­‐service  Process  •  Outlook:  Op;que  -­‐  Scalable  End  User  Access  to  Big  Data  
  4. 4. Big  Data                “Big  data  consists  of  data  sets  that  grow  so  large  that  they  become  awkward  to  work  with  using  on-­‐hand  database  management  tools.”    (Wikipedia)    •  12  terabytes  of  Tweets  created  daily  •  30  terabytes  of  telescope  data  each  night  •  350  billion  meter  readings    •  ...  
  5. 5. Op/que  Case  Study:  Statoil  Explora/on  Experts  in  geology  and  geophysics  develop  stra/graphic  models  of  unexplored  areas.  •  Based  on  produc/on  and  explora/on   data  from  nearby  loca/ons  •  Analy/cs  on:   •  1,000  TB  of  rela/onal  data   •  using  diverse  schemata   •  spread  over  2,000  tables   •  spread  over  mul/ple  individual  data  bases  •  900  experts  in  Statoil  Explora/on  •  up  to  4  days  for  new  data  access  queries  •  assistance  from  IT-­‐experts  required  
  6. 6. LOD  as  an  Example  of  Horizontal  Big  Data  
  7. 7. Semantic Technologies forHorizontal Big DataLinked  Data  •  Set  of  standards,  principles  for  publishing,  sharing  and  interrela/ng   structured  data:  RDF  as  data  model,  SPARQL  for  querying  •  Graph-­‐based  data  model  for  achieving  higher  degree  of  variety  •  Seman/cally  interlink  data  scabered  among  different  informa/on   spaces:  from  data  silos  to  a  Web  of  Data  Ontologies   •  For  describing  the  seman;cs  of  the  data     •  As  conceptual  models  for  end-­‐user  oriented  access   •  For  the  integra;on  of  heterogeneous  sources   •  For  (light-­‐weight)  reasoning     8
  8. 8. On  Demand  Access  to  Big  Data   Enabling  on  demand  data  access   1.  discovery  of  relevant  data  sources   2.  automated  integra/on  and  interlinking  of  sources,  and     3.  interac/ve  explora/on  and  ad  hoc  analysis  of  data   =>  Linked  Data  as  a  Service  
  9. 9. Everything  as  a  Service  •  Abstract  from  physical  implementa;on  details  and  loca;on   of  resources  •  Regardless  of  geographic  or  organiza/onal  separa;on  of   provider  and  consumer    •  “In  the  cloud”   Data as a Service•  Web  based  •  Virtualized   Software as a Service•  On-­‐demand  •  Self-­‐service   Platform as a Service•  Scalable  •  Pay  as  you  go   Infrastructure as a Service
  10. 10. Linked  Data  as  a  Service   “Like   all   members   of   the   "as   a   Service”   family,   DaaS   is   based   on   the   concept  that  the  product,  data  in  this  case,  can  be  provided  on  demand   to   the   user   regardless   of   geographic   or   organiza/onal   separa/on   of   provider  and  consumer.”   Source:  Wikipedia    •  Data  virtualiza;on  supported  by  Linked  Data  principles   1.  Use  URIs  as  names  for  things     2.  Use  HTTP  URIs  so  that  people  can  look  up  those  names.     3.  When  someone  looks  up  a  URI,  provide  useful  informa/on,  using  the  standards:  RDF,  SPARQL     4.  Include  links  to  other  URIs,  to  discover  more  things.  •  Linked  Data  as  abstrac/on  layer  for  virtualized  data  access  across  data  spaces  •  Enables  data  portability  across  current  data  silos    •  Plaform  independent  data  access  •  Basis  for  enabling  automa;on  of  discovery,  composi;on,  and  use  of  datasets   11  
  11. 11. Informa/on  Workbench  -­‐  Linked  Data  Plaform              Informa;on  Workbench:   §  Seman/cs-­‐  &  Linked  Data-­‐based   integra;on  of  private  and  public   data  sources   §  Intelligent  Data  Access  and   Analy;cs   §  Visual  Explora/on   §  Seman/c  Search   §  Dashboarding  and  Repor/ng   §  Collabora;on  and  knowledge   management  plaform   §  Wiki-­‐based  cura/on  &   authoring  of  data   Seman/c  Web  Data   §  Collabora/ve  workflows   12  
  12. 12. Enabling  Data  Discovery:  Metadata  about  Data  Sets  •  Metadata  about  data  sources  essen/al  for  dynamic  discovery  •  Based  on  metadata  vocabularies  (VoID,  DCAT)  •  Access  to  data  registered  at  global  registries,  e.g.  ckan.org,  data.gov,  …  •  Sort/filter  data  sets  by  topic,  license,  size  and  many  more  facets  to  iden/fy   relevant  data  •  Visually  explore  data  sets  
  13. 13. FedX  Federated  Query  Processing  1)  Involve  only  relevant  sources  in  the  evalua/on    Problem:  Subqueries  are  sent  to  all  sources,  although  poten/ally  irrelevant    2)  Compute  joins  close  to  the  data    Problem:  All  joins  are  executed  locally  in  a  nested  loop  fashion  3)  Reduce  remote  communica/on    Problem:  Nested  loop  join  causes  many  remote  requests   14  
  14. 14. Enabling  Data  Composi/on:  FederaAon  of  Virtualized  Data  Sources   Applicaon  Layer   Virtualizaon  Layer   Data  Layer   SPARQL   SPARQL   SPARQL   SPARQL   Endpoint   Endpoint   Endpoint   Endpoint   Metadata     Registry   Data  Source   Data  Source   Data  Source   Data  Source   See  also:  FedX:  Opmizaon  Techniques  for  Federated  Query  Processing  on  Linked  Data  (ISWC2011)    
  15. 15. Enabling  On  Demand  Use:  Self-­‐service  Linked  Data  Frontend  •  Seman/c  Wiki  as  user  frontend  •  Declara;ve  specifica;on  of  the  UI  based   on  available  pool  of  widgets  and   declara/ve  wiki-­‐based  syntax  •  Widgets  have  direct  access  to  the  DB  •  Type-­‐based  template  mechanism  •  Ad  hoc  data  explora/on,  visualiza/on,   analy/cs,  dashboards,  ...   Wiki  Page  in  Edit  Mode  …   …  and  Displayed  Result  Page  
  16. 16. Informa/on  Workbench  –  Linked  Data  as  a  Service  ApplicaAon  Areas  Knowledge  Management  in  the  Life  Sciences        Digital  Libraries,  Media  and  Content  Management        Intelligent  Data  Center  Management    
  17. 17. Example:  Linked  Data  in  Pharma   Main  Use  Cases     •  Integrate  data  from   company-­‐internal  Search,  Interrogate  and   Visualize,  Analyze  and   Capture  and  Augment   Reason   Explore   Knowledge   data  silos   •  Augment  company-­‐ Integrated  data  graph  over  all  data  sources   internal  data  with   Integ   Linked  Open  Data   •  Collabora/ve   knowledge   management   •  Support  of  internal   processes  (drug   development)    Private  Data  Sources   Public  Data  Sources  
  18. 18. Example:     Data  Center  Management  •  Support  collabora;ve   opera;ons  management   in  the  data  center   •  Link  business  data  to  technical  data   •  Technical  Documenta/on   •  Analy/cs  and  Repor/ng   •  Performance  and  Capacity  Monitoring   •  Responsibility  Management   •  Resource  Management   •  Change  Management   •  Technical  Ticke/ng  System   19  
  19. 19. Example:  A  Cloud  Portal  for  Access  to  Open  Data  with  the  Informa/on  Workbench  Goal   ...  using  the    •  Collect  meta  data  from  global  data  markets  (LOD   fluid  OperaAons   Cloud,  WorldBank,  CKAN,  …)  •  Allow  integrated  search  and  ad  hoc  integra/on  of  data   Technology  Stack   sources  from  different  repositories    •  Link  data  with  private/internal  data  sources,  if  desired    •  Support  semi-­‐automated  linking  between  data  sets    •  Provide  visualiza/on,  explora/on,  and  analy/cs     func/onality  on  top  of  integrated  data  sources          Realiza;on  •  Currently  running  project  with  the  Hasso  Plabner   Ins/tute  (Potsdam,  Germany)  •  Create  local  repository  containing  data  market   metadata  •  Use  self-­‐service  technology  to  make  services  publicly   available  +  Informa/on  Workbench  for  analy/cs  
  20. 20. Informa/on  Workbench:  Linked  Data  as  a  Service  in  a  Cloud  Plaform  Architecture   Applicaon  Layer  (SaaS)   Provisioning,  Monitoring    and    Management   Virtualizaon  Layer   Infrastructure  Layer  (IaaS)   Data  Layer  (DaaS)   Netw.-­‐Ab.  Storage   Network   Compu/ng  Resources     Enterprise  Data  Sources Open  Data  Sources  
  21. 21. Provisioning,  Monitoring    and    Management   Applicaon  Layer  (SaaS)   Virtualizaon  Layer   Infrastructure  Layer  (IaaS)   Data  Layer  (DaaS)   Netw.-­‐Ab.  Storage   Network   Compu/ng  Resources     Enterprise  Data  Sources Open  Data  Sources   Self-­‐service   Data  Integra/on   Self-­‐service  UI  &   Data  Discovery   Deployment   &  Federa/on   Analy/cs  •  Self-­‐service  deployment   •  On  demand  access  to   •  Virtualized  data   •  Living  UI,  composed   of  the  InformaAon   private  and  public   access   from  seman/cs-­‐aware   Workbench  in  the  cloud   data  sources   •  Dynamic  integra/on  &   widgets  •  Pay-­‐per-­‐use   •  Dynamic  Discovery   federa/on  of  data   •  Ad  hoc  data  •  Scalability  on  demand   sources   explora/on,   visualiza/on,  analy/cs  
  22. 22. Op/que  Components  
  23. 23. Op/que  Real-­‐/me  stream   Query  Evalua/on   processing   with  Elas/c  Clouds   Scalable  Query   End-­‐user-­‐oriented   Rewri/ng   Query  Interfaces  
  24. 24. Summary  •  Big  Data  means  more  than  volume  and  ver/cal  scale  •  Seman/c  Technologies  for  Big  Data  management   •  Linked  Data  as  adequate  data  model   •  Ontologies  as  conceptual  models  to  access  big  data   •  Integra/on  of  diverse,  heterogeneus  data  sources  •  Linked  Data  as  a  Service  for  enabling  on  demand  data  access   1.  discovery  of  relevant  data  sources   2.  automated  integra/on  and  interlinking  of  sources,  and     3.  interac/ve  explora/on  and  ad  hoc  analysis  of  data  •  Outlook:  Op/que  –  Scalable  end-­‐user  access  to  Big  Data  
  25. 25. CONTACT:  fluid  Opera;ons  Altrostr.  31  Walldorf,  Germany    Email:  peter.haase@fluidops.com    website:  www.fluidops.com  Tel.:  +49  6227  3846-­‐527