Learn	
  with	
  WSO2	
  -­‐	
  Building	
  
your	
  Big	
  Data	
  Solu8on	
  
	
  Srinath	
  Perera	
  
Director	
  of	
  Research	
  
WSO2	
  Inc.	
  	
  
About WSO2
•  Providing the only complete open source componentized
cloud platform
–  Dedicated to removing all the stumbling blocks to enterprise agility
–  Enabling you to focus on business logic and business value
•  Recognized by leading analyst firms as visionaries and
leaders
–  Gartner cites WSO2 as visionaries in all 3 categories of
application infrastructure
–  Forrester places WSO2 in top 2 for API Management
•  Global corporation with offices in USA, UK & Sri Lanka
–  200+ employees and growing
•  Business model of selling comprehensive support &
maintenance for our products
150+ globally positioned support customers
Consider	
  a	
  day	
  in	
  your	
  life	
  
•  What	
  is	
  the	
  best	
  road	
  to	
  take?	
  
•  Would	
  there	
  be	
  any	
  bad	
  
weather?	
  
•  What	
  is	
  the	
  best	
  way	
  to	
  invest	
  
the	
  money?	
  
•  Should	
  I	
  take	
  that	
  loan?	
  
•  Can	
  I	
  op8mize	
  my	
  day?	
  
•  Is	
  there	
  a	
  way	
  to	
  do	
  this	
  
faster?	
  
•  What	
  have	
  others	
  done	
  in	
  
similar	
  cases?	
  
•  Which	
  product	
  should	
  I	
  buy?	
  	
  
	
  
People	
  wanted	
  to	
  (through	
  ages)	
  
•  To	
  know	
  (what	
  
happened?)	
  
•  To	
  Explain	
  (why	
  it	
  
happened)	
  
•  To	
  Predict	
  (what	
  will	
  
happen?)	
  
	
  
What	
  is	
  Big	
  data?	
  
•  There	
  is	
  lot	
  of	
  data	
  available	
  
–  E.g.	
  Internet	
  of	
  things	
  	
  
•  We	
  have	
  compu8ng	
  power	
  	
  
•  We	
  have	
  technology	
  	
  
•  Goal	
  is	
  same	
  
–  To	
  know	
  
–  To	
  Explain	
  	
  
–  To	
  predict	
  	
  
•  Challenge	
  is	
  the	
  full	
  lifecycle	
  	
  
Drivers	
  of	
  Big	
  Data	
  
Data	
  Avalanche/	
  Moore’s	
  law	
  of	
  data	
  
•  We	
  are	
  now	
  collec8ng	
  and	
  conver8ng	
  large	
  amount	
  
of	
  data	
  to	
  digital	
  forms	
  	
  
•  90%	
  of	
  the	
  data	
  in	
  the	
  world	
  today	
  was	
  created	
  
within	
  the	
  past	
  two	
  years.	
  	
  
•  Amount	
  of	
  data	
  we	
  have	
  doubles	
  very	
  fast	
  
In	
  real	
  life,	
  most	
  data	
  are	
  Big	
  
•  Web	
  does	
  millions	
  of	
  ac8vi8es	
  per	
  second,	
  and	
  so	
  
much	
  server	
  logs	
  are	
  created.	
  	
  	
  
•  Social	
  networks	
  e.g.	
  Facebook,	
  800	
  Million	
  ac8ve	
  
users,	
  40	
  billion	
  photos	
  from	
  its	
  user	
  base.	
  
•  There	
  are	
  >4	
  billion	
  phones	
  and	
  >25%	
  are	
  smart	
  
phones.	
  There	
  are	
  billions	
  of	
  RFID	
  tags.	
  	
  
•  Observa8onal	
  and	
  Sensor	
  data	
  
–  Weather	
  Radars,	
  Balloons	
  	
  
–  Environmental	
  Sensors	
  	
  
–  Telescopes	
  	
  
–  Complex	
  physics	
  simula8ons	
  
Why	
  Big	
  Data	
  is	
  hard?	
  
•  How	
  store?	
  Assuming	
  1TB	
  bytes	
  it	
  
takes	
  1000	
  computers	
  to	
  store	
  a	
  1PB	
  	
  
•  How	
  to	
  move?	
  Assuming	
  10Gb	
  
network,	
  it	
  takes	
  2	
  hours	
  to	
  copy	
  1TB,	
  
or	
  83	
  days	
  to	
  copy	
  a	
  1PB	
  	
  
•  How	
  to	
  search?	
  Assuming	
  each	
  record	
  
is	
  1KB	
  and	
  one	
  machine	
  can	
  process	
  
1000	
  records	
  per	
  sec,	
  it	
  needs	
  277CPU	
  
days	
  to	
  process	
  a	
  1TB	
  and	
  785	
  CPU	
  
years	
  to	
  process	
  a	
  1	
  PB	
  
•  How	
  to	
  process?	
  	
  
–  How	
  to	
  convert	
  algorithms	
  to	
  work	
  in	
  
large	
  size	
  
–  How	
  to	
  create	
  new	
  algorithms	
  
hap://www.susanica.com/photo/9	
  
Why	
  it	
  is	
  hard	
  (Contd.)?	
  
•  System	
  build	
  of	
  many	
  
computers	
  	
  
•  That	
  handles	
  lots	
  of	
  data	
  
•  Running	
  complex	
  logic	
  	
  
•  This	
  pushes	
  us	
  to	
  fron8er	
  of	
  
Distributed	
  Systems	
  and	
  
Databases	
  	
  
•  More	
  data	
  does	
  not	
  mean	
  
there	
  is	
  a	
  simple	
  model	
  	
  
•  Some	
  models	
  can	
  be	
  complex	
  
as	
  the	
  system	
  
hap://www.flickr.com/photos/mariachily/5250487136,	
  
	
  Licensed	
  CC	
  
Big	
  Data	
  Architecture	
  
WSO2	
  Offerings	
  
•  Two	
  tools	
  	
  
– WSO2	
  BAM	
  for	
  store	
  and	
  process	
  	
  
– WSO2	
  CEP	
  for	
  real8me	
  processing	
  
•  These	
  tools	
  covers	
  whole	
  processing	
  life	
  cycle	
  
for	
  your	
  Big	
  Data	
  with	
  help	
  of	
  few	
  other	
  
products	
  as	
  needed.	
  	
  
– WSO2	
  Storage	
  server	
  
– WSO2	
  User	
  Experience	
  Server	
  	
  
Big	
  Data	
  Architecture	
  Implementa8on	
  
Sensors	
  
•  Built	
  sensors	
  in	
  WSO2	
  
Products	
  
•  Event	
  logs	
  	
  
–  Click	
  streams,	
  Emails,	
  chat,	
  
search,	
  tweets	
  ,Transac8ons	
  …	
  
•  Custom	
  Sensors	
  	
  
–  Video	
  surveillance,	
  Cash	
  flows,	
  
Traffic,	
  Surveillance,	
  Smart	
  Grid,	
  
Produc8on	
  line,	
  RFID	
  (e.g.	
  
Walmart),	
  GPS	
  sensors,	
  Mobile	
  
Phone,	
  Internet	
  of	
  Things	
  	
  
	
  
hap://www.flickr.com/photos/imuaoo/4257813689/	
  by	
  Ian	
  Muaoo,	
  
hap://www.flickr.com/photos/eastcapital/4554220770/,	
  hap://www.flickr.com/
photos/patdavid/4619331472/	
  by	
  Pat	
  David	
  copyright	
  CC	
  
Collec8ng	
  Data	
  
•  Data	
  collected	
  at	
  sensors	
  and	
  sent	
  to	
  big	
  data	
  
system	
  via	
  events	
  or	
  flat	
  files	
  
•  Event	
  Streams:	
  we	
  name	
  the	
  events	
  by	
  its	
  
content/	
  originator	
  	
  
•  Get	
  data	
  through	
  	
  
– Point	
  to	
  Point	
  
– Event	
  Bus	
  
•  E.g.	
  Data	
  bridge	
  
– a	
  thrij	
  based	
  transport	
  we	
  
did	
  that	
  do	
  about	
  400k	
  
events/	
  sec	
  
Storing	
  Data	
  
•  Historically	
  we	
  used	
  databases	
  
–  Scale	
  is	
  a	
  challenge:	
  replica8on,	
  
sharding	
  	
  
•  Scalable	
  op8ons	
  	
  	
  
–  NoSQL	
  (Cassandra,	
  Hbase)	
  [If	
  
data	
  is	
  structured]	
  
•  Column	
  families	
  Gaining	
  Ground	
  
–  Distributed	
  file	
  systems	
  (e.g.	
  
HDFS)	
  [If	
  data	
  is	
  unstructured]	
  
•  New	
  SQL	
  
–  In	
  Memory	
  compu8ng,	
  VoltDB	
  	
  
•  Specialized	
  data	
  structures	
  
–  Graph	
  Databases,	
  Data	
  structure	
  
servers	
  	
  	
   hap://www.flickr.com/photos/keso/
363133967/	
  
Storing	
  Data	
  (Contd.)	
  
•  WSO2	
  Offerings	
  (WSO2	
  Storage	
  Server)	
  
– Small	
  Structured	
  Data:	
  	
  keep	
  in	
  rela8onal	
  
databases.	
  	
  
– Large	
  structured	
  data	
  :	
  Cassandra	
  
– Large	
  unstructured	
  data:	
  HDFS	
  
Making	
  Sense	
  of	
  Data	
  
•  To	
  know	
  (what	
  happened?)	
  
–  Basic	
  analy8cs	
  +	
  
visualiza8ons	
  (min,	
  max,	
  
average,	
  histogram,	
  
distribu8ons	
  …	
  )	
  
–  Interac8ve	
  drill	
  down	
  
•  To	
  explain	
  (why)	
  
–  Data	
  mining,	
  classifica8ons,	
  
building	
  models,	
  clustering	
  	
  	
  	
  
•  To	
  forecast	
  	
  
–  Neural	
  networks,	
  decision	
  
models	
  	
  
Making	
  Sense	
  of	
  Data	
  (Contd.)	
  
•  Batch	
  processing	
  –	
  WSO2	
  BAM	
  
– Hive	
  Scripts	
  	
  
– Map	
  Reduce	
  Jobs	
  	
  
•  Real	
  8me	
  processing	
  –	
  CEP	
  	
  
– Event	
  Query	
  Language	
  	
  
•  Above	
  two	
  are	
  the	
  plarorm,	
  you	
  need	
  to	
  
program	
  your	
  usecase.	
  	
  
To	
  know	
  (what	
  happened?)	
  
•  Mainly	
  Analy8cs	
  
–  Min,	
  Max,	
  average,	
  
correla8on,	
  histograms	
  	
  
–  Might	
  join	
  group	
  data	
  in	
  
many	
  ways	
  	
  
•  Implemented	
  with	
  
MapReduce	
  or	
  Queries	
  	
  
•  Data	
  is	
  ojen	
  presented	
  with	
  
some	
  visualiza8ons	
  
•  Examples	
  
–  	
  forensics	
  	
  
–  Assessments	
  
–  Historical	
  data/	
  reports/	
  
trends	
  	
  	
  
hap://www.flickr.com/photos/isriya/
2967310333/	
  
To	
  Explain	
  (Paaerns)	
  
•  Correla8on	
  
–  Scaaer	
  plot,	
  sta8s8cal	
  
correla8on	
  
•  Data	
  Mining	
  (Detec8ng	
  
Paaerns)	
  
–  Clustering	
  and	
  classifica8on	
  	
  
–  Finding	
  Similar	
  items	
  	
  
–  Finding	
  Hubs	
  and	
  authori8es	
  
in	
  a	
  Graph	
  	
  
–  Finding	
  frequent	
  item	
  sets	
  
–  Making	
  recommenda8on	
  	
  
•  Apache	
  Mahout	
  	
  
hap://www.flickr.com/photos/eriwst/2987739376/	
  and	
  hap://www.flickr.com/photos/focx/5035444779/	
  	
  	
  	
  
To	
  Predict:	
  Forecasts	
  and	
  Models	
  
•  Trying	
  to	
  build	
  a	
  model	
  for	
  the	
  
data	
  
•  Theore8cally	
  or	
  empirically	
  	
  
–  Analy8cal	
  models	
  (e.g.	
  Physics)	
  
–  Neural	
  networks	
  	
  
–  Reinforcement	
  learning	
  	
  
–  Unsupervised	
  learning	
  (clustering,	
  
dimensionality	
  reduc8on,	
  kernel	
  
methods)	
  
•  Examples	
  	
  
–  Transla8on	
  	
  
–  Weather	
  Forecast	
  models	
  	
  
–  Building	
  profiles	
  of	
  users	
  	
  
–  Traffic	
  models	
  
–  Economic	
  models	
  	
  
•  Lot	
  of	
  domain	
  specific	
  work	
  
	
  
hap://misterbijou.blogspot.com/
2010_09_01_archive.html	
  
Informa8on	
  Visualiza8on	
  
•  Presen8ng	
  informa8on	
  	
  
–  To	
  end	
  user	
  	
  
–  To	
  decision	
  takers	
  	
  
–  To	
  scien8st	
  	
  
•  Interac8ve	
  explora8on	
  
•  Sending	
  alerts	
  	
  	
  
•  WSO2	
  UES	
  	
  
–  Jaggery	
  based	
  	
  
•  BAM/	
  CEP	
  can	
  Work	
  with	
  
most	
  other	
  UI	
  tools	
  
hap://www.flickr.com/photos/
stevefaeembra/3604686097/	
  
WSO2	
  UES	
  
•  Dashboards,	
  and	
  Store	
  
•  Build	
  your	
  own	
  Uis	
  with	
  
Jaggery	
  	
  
MapReduce/	
  Hadoop	
  
•  First	
  introduced	
  by	
  Google,	
  
and	
  used	
  as	
  the	
  processing	
  
model	
  for	
  their	
  architecture	
  	
  
•  Implemented	
  by	
  opensource	
  
projects	
  like	
  Apache	
  Hadoop	
  
and	
  Spark	
  	
  
•  Users	
  writes	
  two	
  func8ons:	
  
map	
  and	
  reduce	
  	
  
•  The	
  framework	
  handles	
  the	
  
details	
  like	
  distributed	
  
processing,	
  fault	
  tolerance,	
  
load	
  balancing	
  etc.	
  	
  
•  Widely	
  used,	
  and	
  the	
  one	
  of	
  
the	
  catalyst	
  of	
  Big	
  data	
  
void map(ctx, k, v){
tokens = v.split();
for t in tokens
ctx.emit(t,1)
}
void reduce(ctx, k, values[]){
count = 0;
for v in values
count = count + v;
ctx.emit(k,count);
}
MapReduce	
  (Contd.)	
  
Data	
  In	
  the	
  Move	
  
•  Idea	
  is	
  to	
  process	
  data	
  as	
  they	
  
are	
  received	
  in	
  streaming	
  
fashion	
  	
  
•  Used	
  when	
  we	
  need	
  	
  
–  Very	
  fast	
  output	
  	
  
–  Lots	
  of	
  events	
  (few	
  100k	
  to	
  
millions)	
  
–  Processing	
  without	
  storing	
  (e.g.	
  
too	
  much	
  data)	
  
•  Two	
  main	
  technologies	
  
–  Stream	
  Processing	
  (e.g.	
  Strom,	
  
hap://storm-­‐project.net/	
  )	
  
–  Complex	
  Event	
  Processing	
  (CEP)	
  
hap://wso2.com/products/
complex-­‐event-­‐processor/	
  	
  
Complex	
  Event	
  Processing	
  (CEP)	
  
•  Sees	
  inputs	
  as	
  Event	
  streams	
  and	
  queried	
  with	
  
SQL	
  like	
  language	
  	
  
•  Supports	
  Filters,	
  Windows,	
  Join,	
  Paaerns	
  and	
  
Sequences	
  	
  
from p=PINChangeEvents#win.time(3600) join
t=TransactionEvents[p.custid=custid][amount>10000]
#win.time(3600)
return t.custid, t.amount;
Summary	
  	
  
Case	
  Study	
  1:	
  Tracing	
  Business	
  Process	
  
•  Business	
  process	
  is	
  built	
  using	
  many	
  services	
  
•  Track	
  trace	
  each	
  
step,	
  and	
  analyze	
  
to	
  understand	
  
how	
  to	
  op8mize	
  	
  
•  E.g.	
  sales	
  pipeline	
  	
  
Some	
  Queries	
  
•  Conversion	
  rate?	
  
•  How	
  many	
  deals	
  in	
  pipeline	
  at	
  each	
  month?	
  
•  Average	
  size	
  of	
  the	
  deals?	
  	
  
•  Average	
  8me	
  deal	
  takes?	
  
•  Can	
  we	
  guess	
  an	
  large	
  size	
  deals	
  early?	
  	
  
•  Which	
  is	
  beaer?	
  Going	
  for	
  few	
  large	
  ones	
  or	
  
many	
  small	
  ones?	
  	
  
•  Was	
  there	
  any	
  delays	
  from	
  Ourside?	
  
Hive:	
  Average	
  Size	
  of	
  the	
  Deal	
  
•  Hive	
  uses	
  an	
  SQL	
  like	
  synatax.	
  	
  
•  Easy	
  to	
  understand	
  and	
  learn	
  	
  
hive> LOAD DATA ..
hive> SELECT avg(value) from LEAD_ACTIVITY
WHERE action=“closedWon” groupby month;
Map	
  Reduce:	
  How	
  many	
  deals	
  in	
  
Pipeline?	
  
How	
  many	
  deals	
  in	
  Pipeline?(Contd.)	
  
void map(ctx, k, v){
Deals deal= parse(v);
int month = getMonth(deal.time);
ctx.emit(month,1)
}
void reduce(ctx, k, values[]){
count = 0;
for v in values
count = count + v;
ctx.emit(k,count);
}
Case	
  study	
  2:	
  DEBS	
  Challenge	
  
•  Event	
  Processing	
  
challenge	
  	
  
•  Real	
  football	
  game,	
  
sensors	
  in	
  player	
  
shoes	
  +	
  ball	
  	
  
•  Events	
  in	
  15k	
  Hz	
  	
  
•  Event	
  format	
  	
  
–  Sensor	
  ID,	
  TS,	
  x,	
  y,	
  z,	
  v,	
  
a	
  
•  Queries	
  
–  Running	
  Stats	
  
–  Ball	
  Possession	
  
–  Heat	
  Map	
  of	
  Ac8vity	
  	
  
–  Shots	
  at	
  Goal	
  	
  
Example:	
  Detect	
  ball	
  Possession	
  	
  
•  Possession	
  is	
  8me	
  a	
  
player	
  hit	
  the	
  ball	
  
un8l	
  someone	
  else	
  
hits	
  it	
  or	
  it	
  goes	
  out	
  
of	
  the	
  ground	
  
from Ball#window.length(1) as b join
Players#window.length(1) as p
unidirectional
on debs: getDistance(b.x,b.y,b.z,
p.x, p.y, p.z) < 1000
and b.a > 55
select ...
insert into hitStream
from old = hitStream ,
b = hitStream [old. pid != pid ],
n= hitStream[b.pid == pid]*,
( e1 = hitStream[b.pid != pid ]
or e2= ballLeavingHitStream)
select ...
insert into BallPossessionStream
hap://www.flickr.com/photos/glennharper/146164820/	
  
Conclusions	
  
•  What	
  is	
  Big	
  Data?	
  	
  
•  Big	
  Data	
  Architecture	
  	
  
– Collec8ng	
  data	
  
– Storing	
  data	
  
– Processing	
  Data	
  
•  WSO2	
  Offerings	
  
•  Case	
  Studies	
  	
  
Ques%ons?	
  
Engage with WSO2
•  Helping you get the most out of your deployments
•  From project evaluation and inception to development
and going into production, WSO2 is your partner in
ensuring 100% project success
Building your big data solution

Building your big data solution

  • 1.
    Learn  with  WSO2  -­‐  Building   your  Big  Data  Solu8on    Srinath  Perera   Director  of  Research   WSO2  Inc.    
  • 2.
    About WSO2 •  Providingthe only complete open source componentized cloud platform –  Dedicated to removing all the stumbling blocks to enterprise agility –  Enabling you to focus on business logic and business value •  Recognized by leading analyst firms as visionaries and leaders –  Gartner cites WSO2 as visionaries in all 3 categories of application infrastructure –  Forrester places WSO2 in top 2 for API Management •  Global corporation with offices in USA, UK & Sri Lanka –  200+ employees and growing •  Business model of selling comprehensive support & maintenance for our products
  • 3.
    150+ globally positionedsupport customers
  • 4.
    Consider  a  day  in  your  life   •  What  is  the  best  road  to  take?   •  Would  there  be  any  bad   weather?   •  What  is  the  best  way  to  invest   the  money?   •  Should  I  take  that  loan?   •  Can  I  op8mize  my  day?   •  Is  there  a  way  to  do  this   faster?   •  What  have  others  done  in   similar  cases?   •  Which  product  should  I  buy?      
  • 5.
    People  wanted  to  (through  ages)   •  To  know  (what   happened?)   •  To  Explain  (why  it   happened)   •  To  Predict  (what  will   happen?)    
  • 6.
    What  is  Big  data?   •  There  is  lot  of  data  available   –  E.g.  Internet  of  things     •  We  have  compu8ng  power     •  We  have  technology     •  Goal  is  same   –  To  know   –  To  Explain     –  To  predict     •  Challenge  is  the  full  lifecycle    
  • 7.
  • 8.
    Data  Avalanche/  Moore’s  law  of  data   •  We  are  now  collec8ng  and  conver8ng  large  amount   of  data  to  digital  forms     •  90%  of  the  data  in  the  world  today  was  created   within  the  past  two  years.     •  Amount  of  data  we  have  doubles  very  fast  
  • 9.
    In  real  life,  most  data  are  Big   •  Web  does  millions  of  ac8vi8es  per  second,  and  so   much  server  logs  are  created.       •  Social  networks  e.g.  Facebook,  800  Million  ac8ve   users,  40  billion  photos  from  its  user  base.   •  There  are  >4  billion  phones  and  >25%  are  smart   phones.  There  are  billions  of  RFID  tags.     •  Observa8onal  and  Sensor  data   –  Weather  Radars,  Balloons     –  Environmental  Sensors     –  Telescopes     –  Complex  physics  simula8ons  
  • 10.
    Why  Big  Data  is  hard?   •  How  store?  Assuming  1TB  bytes  it   takes  1000  computers  to  store  a  1PB     •  How  to  move?  Assuming  10Gb   network,  it  takes  2  hours  to  copy  1TB,   or  83  days  to  copy  a  1PB     •  How  to  search?  Assuming  each  record   is  1KB  and  one  machine  can  process   1000  records  per  sec,  it  needs  277CPU   days  to  process  a  1TB  and  785  CPU   years  to  process  a  1  PB   •  How  to  process?     –  How  to  convert  algorithms  to  work  in   large  size   –  How  to  create  new  algorithms   hap://www.susanica.com/photo/9  
  • 11.
    Why  it  is  hard  (Contd.)?   •  System  build  of  many   computers     •  That  handles  lots  of  data   •  Running  complex  logic     •  This  pushes  us  to  fron8er  of   Distributed  Systems  and   Databases     •  More  data  does  not  mean   there  is  a  simple  model     •  Some  models  can  be  complex   as  the  system   hap://www.flickr.com/photos/mariachily/5250487136,    Licensed  CC  
  • 12.
  • 13.
    WSO2  Offerings   • Two  tools     – WSO2  BAM  for  store  and  process     – WSO2  CEP  for  real8me  processing   •  These  tools  covers  whole  processing  life  cycle   for  your  Big  Data  with  help  of  few  other   products  as  needed.     – WSO2  Storage  server   – WSO2  User  Experience  Server    
  • 14.
    Big  Data  Architecture  Implementa8on  
  • 15.
    Sensors   •  Built  sensors  in  WSO2   Products   •  Event  logs     –  Click  streams,  Emails,  chat,   search,  tweets  ,Transac8ons  …   •  Custom  Sensors     –  Video  surveillance,  Cash  flows,   Traffic,  Surveillance,  Smart  Grid,   Produc8on  line,  RFID  (e.g.   Walmart),  GPS  sensors,  Mobile   Phone,  Internet  of  Things       hap://www.flickr.com/photos/imuaoo/4257813689/  by  Ian  Muaoo,   hap://www.flickr.com/photos/eastcapital/4554220770/,  hap://www.flickr.com/ photos/patdavid/4619331472/  by  Pat  David  copyright  CC  
  • 16.
    Collec8ng  Data   • Data  collected  at  sensors  and  sent  to  big  data   system  via  events  or  flat  files   •  Event  Streams:  we  name  the  events  by  its   content/  originator     •  Get  data  through     – Point  to  Point   – Event  Bus   •  E.g.  Data  bridge   – a  thrij  based  transport  we   did  that  do  about  400k   events/  sec  
  • 17.
    Storing  Data   • Historically  we  used  databases   –  Scale  is  a  challenge:  replica8on,   sharding     •  Scalable  op8ons       –  NoSQL  (Cassandra,  Hbase)  [If   data  is  structured]   •  Column  families  Gaining  Ground   –  Distributed  file  systems  (e.g.   HDFS)  [If  data  is  unstructured]   •  New  SQL   –  In  Memory  compu8ng,  VoltDB     •  Specialized  data  structures   –  Graph  Databases,  Data  structure   servers       hap://www.flickr.com/photos/keso/ 363133967/  
  • 18.
    Storing  Data  (Contd.)   •  WSO2  Offerings  (WSO2  Storage  Server)   – Small  Structured  Data:    keep  in  rela8onal   databases.     – Large  structured  data  :  Cassandra   – Large  unstructured  data:  HDFS  
  • 19.
    Making  Sense  of  Data   •  To  know  (what  happened?)   –  Basic  analy8cs  +   visualiza8ons  (min,  max,   average,  histogram,   distribu8ons  …  )   –  Interac8ve  drill  down   •  To  explain  (why)   –  Data  mining,  classifica8ons,   building  models,  clustering         •  To  forecast     –  Neural  networks,  decision   models    
  • 20.
    Making  Sense  of  Data  (Contd.)   •  Batch  processing  –  WSO2  BAM   – Hive  Scripts     – Map  Reduce  Jobs     •  Real  8me  processing  –  CEP     – Event  Query  Language     •  Above  two  are  the  plarorm,  you  need  to   program  your  usecase.    
  • 21.
    To  know  (what  happened?)   •  Mainly  Analy8cs   –  Min,  Max,  average,   correla8on,  histograms     –  Might  join  group  data  in   many  ways     •  Implemented  with   MapReduce  or  Queries     •  Data  is  ojen  presented  with   some  visualiza8ons   •  Examples   –   forensics     –  Assessments   –  Historical  data/  reports/   trends       hap://www.flickr.com/photos/isriya/ 2967310333/  
  • 22.
    To  Explain  (Paaerns)   •  Correla8on   –  Scaaer  plot,  sta8s8cal   correla8on   •  Data  Mining  (Detec8ng   Paaerns)   –  Clustering  and  classifica8on     –  Finding  Similar  items     –  Finding  Hubs  and  authori8es   in  a  Graph     –  Finding  frequent  item  sets   –  Making  recommenda8on     •  Apache  Mahout     hap://www.flickr.com/photos/eriwst/2987739376/  and  hap://www.flickr.com/photos/focx/5035444779/        
  • 23.
    To  Predict:  Forecasts  and  Models   •  Trying  to  build  a  model  for  the   data   •  Theore8cally  or  empirically     –  Analy8cal  models  (e.g.  Physics)   –  Neural  networks     –  Reinforcement  learning     –  Unsupervised  learning  (clustering,   dimensionality  reduc8on,  kernel   methods)   •  Examples     –  Transla8on     –  Weather  Forecast  models     –  Building  profiles  of  users     –  Traffic  models   –  Economic  models     •  Lot  of  domain  specific  work     hap://misterbijou.blogspot.com/ 2010_09_01_archive.html  
  • 24.
    Informa8on  Visualiza8on   • Presen8ng  informa8on     –  To  end  user     –  To  decision  takers     –  To  scien8st     •  Interac8ve  explora8on   •  Sending  alerts       •  WSO2  UES     –  Jaggery  based     •  BAM/  CEP  can  Work  with   most  other  UI  tools   hap://www.flickr.com/photos/ stevefaeembra/3604686097/  
  • 25.
    WSO2  UES   • Dashboards,  and  Store   •  Build  your  own  Uis  with   Jaggery    
  • 26.
    MapReduce/  Hadoop   • First  introduced  by  Google,   and  used  as  the  processing   model  for  their  architecture     •  Implemented  by  opensource   projects  like  Apache  Hadoop   and  Spark     •  Users  writes  two  func8ons:   map  and  reduce     •  The  framework  handles  the   details  like  distributed   processing,  fault  tolerance,   load  balancing  etc.     •  Widely  used,  and  the  one  of   the  catalyst  of  Big  data   void map(ctx, k, v){ tokens = v.split(); for t in tokens ctx.emit(t,1) } void reduce(ctx, k, values[]){ count = 0; for v in values count = count + v; ctx.emit(k,count); }
  • 27.
  • 28.
    Data  In  the  Move   •  Idea  is  to  process  data  as  they   are  received  in  streaming   fashion     •  Used  when  we  need     –  Very  fast  output     –  Lots  of  events  (few  100k  to   millions)   –  Processing  without  storing  (e.g.   too  much  data)   •  Two  main  technologies   –  Stream  Processing  (e.g.  Strom,   hap://storm-­‐project.net/  )   –  Complex  Event  Processing  (CEP)   hap://wso2.com/products/ complex-­‐event-­‐processor/    
  • 29.
    Complex  Event  Processing  (CEP)   •  Sees  inputs  as  Event  streams  and  queried  with   SQL  like  language     •  Supports  Filters,  Windows,  Join,  Paaerns  and   Sequences     from p=PINChangeEvents#win.time(3600) join t=TransactionEvents[p.custid=custid][amount>10000] #win.time(3600) return t.custid, t.amount;
  • 30.
  • 31.
    Case  Study  1:  Tracing  Business  Process   •  Business  process  is  built  using  many  services   •  Track  trace  each   step,  and  analyze   to  understand   how  to  op8mize     •  E.g.  sales  pipeline    
  • 32.
    Some  Queries   • Conversion  rate?   •  How  many  deals  in  pipeline  at  each  month?   •  Average  size  of  the  deals?     •  Average  8me  deal  takes?   •  Can  we  guess  an  large  size  deals  early?     •  Which  is  beaer?  Going  for  few  large  ones  or   many  small  ones?     •  Was  there  any  delays  from  Ourside?  
  • 33.
    Hive:  Average  Size  of  the  Deal   •  Hive  uses  an  SQL  like  synatax.     •  Easy  to  understand  and  learn     hive> LOAD DATA .. hive> SELECT avg(value) from LEAD_ACTIVITY WHERE action=“closedWon” groupby month;
  • 34.
    Map  Reduce:  How  many  deals  in   Pipeline?  
  • 35.
    How  many  deals  in  Pipeline?(Contd.)   void map(ctx, k, v){ Deals deal= parse(v); int month = getMonth(deal.time); ctx.emit(month,1) } void reduce(ctx, k, values[]){ count = 0; for v in values count = count + v; ctx.emit(k,count); }
  • 36.
    Case  study  2:  DEBS  Challenge   •  Event  Processing   challenge     •  Real  football  game,   sensors  in  player   shoes  +  ball     •  Events  in  15k  Hz     •  Event  format     –  Sensor  ID,  TS,  x,  y,  z,  v,   a   •  Queries   –  Running  Stats   –  Ball  Possession   –  Heat  Map  of  Ac8vity     –  Shots  at  Goal    
  • 37.
    Example:  Detect  ball  Possession     •  Possession  is  8me  a   player  hit  the  ball   un8l  someone  else   hits  it  or  it  goes  out   of  the  ground   from Ball#window.length(1) as b join Players#window.length(1) as p unidirectional on debs: getDistance(b.x,b.y,b.z, p.x, p.y, p.z) < 1000 and b.a > 55 select ... insert into hitStream from old = hitStream , b = hitStream [old. pid != pid ], n= hitStream[b.pid == pid]*, ( e1 = hitStream[b.pid != pid ] or e2= ballLeavingHitStream) select ... insert into BallPossessionStream hap://www.flickr.com/photos/glennharper/146164820/  
  • 38.
    Conclusions   •  What  is  Big  Data?     •  Big  Data  Architecture     – Collec8ng  data   – Storing  data   – Processing  Data   •  WSO2  Offerings   •  Case  Studies    
  • 39.
  • 40.
    Engage with WSO2 • Helping you get the most out of your deployments •  From project evaluation and inception to development and going into production, WSO2 is your partner in ensuring 100% project success