SlideShare a Scribd company logo
Introduction to Big Data
Dr. Putchong Uthayopas
Department of Computer Engineering,
Faculty of Engineering, Kasetsart University
Email: putchong@ku.th
We	
  are	
  living	
  in	
  the	
  world	
  of	
  Data	
  
Geophysical
Exploration
Medical Imaging
Video
Surveillance
Mobile Sensors
Gene Sequencing
Smart Grids
Social Media
Big data is high-volume, high-velocity and high-
variety information assets that demand cost-
effective, innovative forms of information
processing for enhanced insight and decision
making.
“Gartner Inc.”
Why	
  BigData?	
  
• Improve	
  product	
  and	
  
service	
  
• Increase	
  customer	
  
sa<sfac<on/behavior	
  
• Improve	
  opera<on	
  
efficiency	
  
• Understand	
  
emerging	
  market	
  
trends	
  	
  	
  
The	
  real	
  value	
  
of	
  big	
  data	
  is	
  in	
  
the	
  insights	
  it	
  	
  
produces	
  when	
  
analyzed—
discovered	
  
paEerns,	
  
derived	
  
meaning,	
  
indicators	
  for	
  
decisions,	
  and	
  	
  
ul<mately	
  the	
  
ability	
  to	
  
respond	
  to	
  the	
  
world	
  	
  with	
  
greater	
  
intelligence.	
  	
  
Know thy self, know
thy enemy. A
thousand battles, a
thousand victories.
h#p://www.intel.com/content/dam/www/public/us/en/
documents/product-­‐briefs/big-­‐data-­‐cloud-­‐technologies-­‐
brief.pdf	
  )	
  
Source:	
  The	
  field	
  guide	
  to	
  Data	
  Science	
  
Big	
  Data	
  vs	
  Business	
  Intelligent	
  vs.	
  
Analy<cs	
  
•  BI	
  soLware	
  and	
  technology	
  
– Well	
  structure	
  data	
  from	
  warehouse	
  
– Visual	
  Representa<on	
  of	
  data	
  to	
  gain	
  insight	
  into	
  
data	
  
– 	
  Some	
  predic<ve	
  capability	
  such	
  as	
  sta<s<cal	
  
analysis	
  ,	
  Data	
  mining	
  
•  Big	
  Data	
  
– Focus	
  on	
  analysis	
  of	
  huge	
  and	
  unstructured	
  data	
  
set	
  to	
  gain	
  insight	
  informa<on	
  automa<cally	
  
Property	
  of	
  Big	
  Data	
  
BIG	
  Data	
  
Volume	
  
Velocity	
  
Variety	
  
Volume	
  
•  Big	
  data	
  must	
  be	
  
huge	
  	
  
– Beyond	
  the	
  
capability	
  of	
  a	
  single	
  
computer	
  server	
  to	
  
process	
  it	
  	
  
– Possible	
  to	
  store	
  the	
  
data	
  but	
  difficult	
  to	
  
process	
  it	
  
Velocity	
  
•  Big	
  data	
  accumulate	
  at	
  a	
  
very	
  fast	
  speed	
  
–  Stock	
  market	
  data	
  
–  Internet	
  	
  access	
  log	
  
–  Social	
  media	
  data	
  
•  TwiEer	
  ,	
  facebook,	
  IG	
  	
  
•  We	
  need	
  to	
  
–  Extract	
  meaning	
  as	
  fast	
  and	
  
as	
  much	
  as	
  we	
  can	
  before	
  
throwing	
  away	
  the	
  data	
  	
  
Variety	
  
•  Data	
  come	
  with	
  
variety	
  
–  Tradi<onal	
  data	
  
base	
  
–  Documents	
  
–  Web	
  page	
  
–  Social	
  media	
  
data	
  
–  Image	
  
–  Video/Audio	
  
–  Loca<on	
  
 Diya	
  Soubra,	
  	
  The	
  3Vs	
  that	
  define	
  Big	
  Data,	
  2012	
  
hEp://www.datasciencecentral.com/forum/topics/the-­‐3vs-­‐that-­‐define-­‐big-­‐data	
  
Considera<on	
  for	
  Applying	
  Big	
  
Data	
  
hEp://fredericgonzalo.com/en/2013/07/07/big-­‐data-­‐in-­‐tourism-­‐hospitality-­‐4-­‐key-­‐components/	
  
BRIEF	
  OVERVIEW	
  OF	
  BIG	
  DATA	
  
TOOLS
Big	
  Data	
  Ecosystem	
  
Reference:	
  hEp://dataconomy.com/understanding-­‐big-­‐data-­‐ecosystem/	
  
Big	
  Data	
  Eco	
  system-­‐	
  Infrastructure	
  
•  Hadoop-­‐	
  
–  technologies	
  designed	
  for	
  the	
  storing,	
  processing	
  and	
  analysing	
  
of	
  data	
  by	
  breaking	
  up	
  and	
  distribu<ng	
  data	
  into	
  parts	
  and	
  
analysing	
  those	
  parts	
  concurrently,	
  rather	
  than	
  tackling	
  one	
  
monolithic	
  block	
  of	
  data	
  all	
  in	
  one	
  go.	
  
•  NoSQL	
  
–  Stands	
  for	
  Not	
  Only	
  SQL	
  
–  involved	
  in	
  processing	
  large	
  volumes	
  of	
  mul<-­‐structured	
  data.	
  
Most	
  NoSQL	
  databases	
  are	
  most	
  adept	
  at	
  handling	
  discrete	
  data	
  
stored	
  among	
  mul<-­‐structured	
  data.	
  	
  
•  Massively	
  Parallel	
  Processing	
  (MPP)	
  Databases	
  
–  MPP	
  databases	
  work	
  by	
  segmen<ng	
  data	
  across	
  mul<ple	
  nodes,	
  
and	
  processing	
  these	
  segments	
  of	
  data	
  in	
  parallel,	
  and	
  uses	
  SQL.	
  	
  
Reference:	
  hEp://dataconomy.com/understanding-­‐big-­‐data-­‐ecosystem/	
  
Big	
  Data	
  Eco	
  system-­‐	
  Analy<cs	
  
•  AnalyHcs	
  PlaIorms	
  
–  Integrate	
  and	
  analyse	
  data	
  to	
  uncover	
  new	
  insights,	
  and	
  help	
  companies	
  make	
  beEer-­‐
informed	
  decisions.	
  	
  
•  VisualizaHon	
  PlaIorms	
  	
  
–  	
  visualizing	
  data;	
  taking	
  the	
  raw	
  data	
  and	
  presen<ng	
  it	
  in	
  complex,	
  mul<-­‐dimensional	
  
visual	
  formats	
  to	
  illuminate	
  the	
  informa<on	
  
•  Business	
  Intelligence	
  (BI)	
  PlaIorms	
  
–  analyze	
  data	
  from	
  mul<ple	
  sources	
  to	
  deliver	
  services	
  such	
  as	
  business	
  intelligence	
  
reports,	
  dashboards	
  and	
  visualiza<ons	
  
•  Machine	
  Learning	
  
–  machine	
  learning	
  is	
  data	
  the	
  algorithm	
  ‘learns	
  from’,	
  and	
  the	
  output	
  depends	
  on	
  the	
  use	
  
case.	
  One	
  of	
  the	
  most	
  famous	
  examples	
  is	
  IBM’s	
  super	
  computer	
  Watson,	
  which	
  has	
  
‘learned’	
  to	
  scan	
  vast	
  amounts	
  of	
  informa<on	
  to	
  find	
  specific	
  answers,	
  and	
  can	
  comb	
  
through	
  200	
  million	
  pages	
  of	
  structured	
  and	
  unstructured	
  data	
  in	
  minutes.	
  	
  
Reference:	
  hEp://dataconomy.com/understanding-­‐big-­‐data-­‐ecosystem/	
  
How	
  can	
  we	
  store	
  and	
  process	
  massive	
  
data	
  
•  Beyond	
  capability	
  of	
  a	
  single	
  server	
  
•  Basic	
  Infrastructure	
  
–  Cluster	
  of	
  servers	
  
–  High	
  speed	
  interconnected	
  
–  High	
  speed	
  storage	
  cluster	
  
•  Incoming	
  data	
  will	
  be	
  spread	
  across	
  the	
  server	
  farm	
  
•  Processing	
  is	
  quickly	
  distributed	
  to	
  the	
  farm	
  
•  Result	
  is	
  collected	
  and	
  send	
  back	
  
NoSQL	
  (Not	
  Only	
  SQL)	
  
•  A	
  NoSQL	
  (oLen	
  interpreted	
  as	
  Not	
  only	
  SQL)	
  
database	
  provides	
  a	
  mechanism	
  for	
  storage	
  and	
  
retrieval	
  of	
  data	
  that	
  is	
  modeled	
  in	
  means	
  other	
  than	
  
the	
  tabular	
  rela<ons	
  used	
  in	
  rela<onal	
  databases.	
  	
  
– being	
  non-­‐relaHonal,	
  distributed,	
  open-­‐
source	
  and	
  horizontally	
  scalable.	
  
– Used	
  to	
  handle	
  a	
  huge	
  amount	
  of	
  data	
  	
  
– The	
  original	
  inten<on	
  has	
  been	
  modern	
  web-­‐scale	
  
databases.	
  	
  
Reference:	
  hEp://nosql-­‐database.org/	
  
•  MongoDB	
  is	
  a	
  general	
  purpose,	
  
open-­‐source	
  database.	
  	
  
•  MongoDB	
  features:	
  
–  Document	
  data	
  model	
  with	
  
dynamic	
  schemas	
  
–  Full,	
  flexible	
  index	
  support	
  and	
  rich	
  
queries	
  
–  Auto-­‐Sharding	
  	
  for	
  horizontal	
  
scalability	
  
–  Built-­‐in	
  replica<on	
  for	
  high	
  
availability	
  
–  Text	
  search	
  
–  Advanced	
  security	
  
•  Hadoop	
  is	
  an	
  open-­‐source	
  soLware	
  framework	
  wriEen	
  in	
  Java	
  for	
  
distributed	
  storage	
  and	
  distributed	
  processing	
  of	
  very	
  large	
  data	
  sets	
  on	
  
computer	
  clusters	
  built	
  from	
  commodity	
  hardware.	
  	
  
•  The	
  base	
  Apache	
  Hadoop	
  framework	
  is	
  composed	
  of	
  the	
  following	
  
modules:	
  
–  Hadoop	
  Common	
  –	
  contains	
  libraries	
  and	
  u<li<es	
  needed	
  by	
  other	
  Hadoop	
  
modules;	
  
–  Hadoop	
  Distributed	
  File	
  System	
  (HDFS)	
  –	
  a	
  distributed	
  file-­‐system	
  that	
  stores	
  
data	
  on	
  commodity	
  machines,	
  providing	
  very	
  high	
  aggregate	
  bandwidth	
  
across	
  the	
  cluster;	
  
–  Hadoop	
  YARN	
  –	
  a	
  resource-­‐management	
  plakorm	
  responsible	
  for	
  managing	
  
compute	
  resources	
  in	
  clusters	
  and	
  using	
  them	
  for	
  scheduling	
  of	
  users'	
  
applica<ons;and	
  
–  Hadoop	
  MapReduce	
  –	
  a	
  programming	
  model	
  for	
  large	
  scale	
  data	
  processing.	
  
•  Hadoop	
  was	
  created	
  by	
  Doug	
  Cumng	
  and	
  Mike	
  Cafarella	
  in	
  2005.	
  Cumng,	
  
who	
  was	
  working	
  at	
  Yahoo!	
  at	
  the	
  <me,	
  named	
  it	
  aLer	
  his	
  son's	
  toy	
  
elephant.	
  
Magic	
  behind	
  Hadoop	
  and	
  HDFS	
  
•  Problem	
  is	
  divided	
  into	
  two	
  phases	
  
–  Map	
  applying	
  some	
  ac<on	
  to	
  data	
  in	
  <key,	
  Value>	
  
Pair	
  and	
  get	
  some	
  intermediate	
  results	
  
–  Reduce	
  summarize	
  intermediate	
  result	
  <key,value>	
  
and	
  return	
  back	
  to	
  main	
  program	
  
Ricky	
  Ho,	
  How	
  Hadoop	
  Map/Reduce	
  works,	
  	
  
hEp://architects.dzone.com/ar<cles/how-­‐hadoop-­‐mapreduce-­‐works	
  
Example:	
  Word	
  count	
  
•  Coun<ng	
  word	
  in	
  an	
  input	
  text	
  file.	
  
–  How	
  many	
  word	
  “love”	
  in	
  a	
  novel?	
  ^_^	
  
•  In	
  map	
  phase	
  the	
  sentence	
  would	
  be	
  split	
  as	
  words	
  and	
  
form	
  the	
  ini<al	
  key	
  value	
  pair	
  <word,	
  1>	
  
•  “tring	
  tring	
  the	
  phone	
  rings”	
  becomes	
  <tring,1>	
  ,<tring,1>,	
  <the,1>,	
  
<phone,1>,	
  <rings,1>	
  
–  In	
  the	
  reduce	
  phase	
  the	
  keys	
  are	
  grouped	
  together	
  and	
  the	
  values	
  
for	
  similar	
  keys	
  are	
  added.	
  	
  
•  There	
  are	
  only	
  one	
  pair	
  of	
  similar	
  keys	
  ‘tring’	
  the	
  values	
  for	
  these	
  keys	
  
would	
  be	
  added	
  so	
  the	
  out	
  put	
  key	
  value	
  pairs	
  would	
  be	
  
•  <tring,2>,	
  <the,1>,	
  <phone,1>,	
  <rings,1>	
  
•  Reduce	
  forms	
  an	
  aggrega<on	
  phase	
  for	
  keys	
  	
  
–  This	
  would	
  give	
  the	
  number	
  of	
  occurrence	
  of	
  each	
  word	
  in	
  the	
  
input.	
  	
  
hEp://kickstarthadoop.blogspot.com/2011/04/word-­‐count-­‐hadoop-­‐map-­‐reduce-­‐
example.html	
  
In-­‐memory	
  Database	
  
•  An	
  in-­‐memory	
  database	
  is	
  	
  
–  a	
  database	
  management	
  system	
  that	
  
primarily	
  relies	
  on	
  main	
  
memory	
  for	
  computer	
  data	
  storage.	
  	
  
–  faster	
  than	
  disk-­‐op<mized	
  databases	
  
since	
  the	
  internal	
  op<miza<on	
  
algorithms	
  are	
  simpler	
  and	
  execute	
  
fewer	
  CPU	
  instruc<ons.	
  
–  	
  Accessing	
  data	
  in	
  memory	
  
eliminates	
  seek	
  <me	
  when	
  querying	
  
the	
  data,	
  which	
  provides	
  faster	
  and	
  
more	
  predictable	
  performance	
  than	
  
disk.	
  
Source:	
  hEp://en.wikipedia.org/wiki/In-­‐memory_database	
  
What	
  is	
  Spark?	
  
Efficient	
  
•  General	
  execu<on	
  graphs	
  
•  In-­‐memory	
  storage	
  
Usable	
  
•  Rich	
  APIs	
  in	
  Java,	
  Scala,	
  
Python	
  
•  Interac<ve	
  shell	
  
Fast and Expressive Cluster Computing !
Engine Compatible with Apache Hadoop
2-­‐5×	
  less	
  code	
  
Up	
  to	
  10×	
  faster	
  on	
  disk,	
  100×	
  in	
  memory	
  
The	
  Spark	
  Community	
  
Spark	
  at	
  Yahoo	
  
•  Personalizing	
  news	
  pages	
  for	
  Web	
  visitors	
  and	
  
another	
  for	
  running	
  analy<cs	
  for	
  adver<sing.	
  
For	
  news	
  personaliza<on,	
  the	
  company	
  uses	
  
ML	
  algorithms	
  running	
  on	
  Spark	
  to	
  figure	
  out	
  
what	
  individual	
  users	
  are	
  interested	
  in,	
  and	
  
also	
  to	
  categorize	
  news	
  stories	
  as	
  they	
  arise	
  to	
  
figure	
  out	
  what	
  types	
  of	
  users	
  would	
  be	
  
interested	
  in	
  reading	
  them.	
  
–  wrote	
  a	
  Spark	
  ML	
  algorithm	
  120	
  lines	
  of	
  Scala.	
  
(Previously,	
  its	
  ML	
  algorithm	
  for	
  news	
  
personaliza<on	
  was	
  wriEen	
  in	
  15,000	
  lines	
  of	
  C++.)	
  
–  With	
  just	
  30	
  minutes	
  of	
  training	
  on	
  a	
  large,	
  hundred	
  
million	
  record	
  data	
  set,	
  the	
  Scala	
  ML	
  algorithm	
  was	
  
ready	
  for	
  business.	
  
•  Second	
  use	
  case	
  shows	
  off	
  Hive	
  on	
  Spark	
  
(Shark’s)	
  interac<ve	
  capability.	
  	
  
–  use	
  exis<ng	
  BI	
  tools	
  to	
  view	
  and	
  query	
  their	
  
adver<sing	
  analy<c	
  data	
  collected	
  in	
  Hadoop.	
  	
  
hEp://www.datanami.com/2014/03/06/apache_spark_3_real-­‐
world_use_cases/	
  
BigData	
  Goes	
  to	
  Cloud	
  
•  Data	
  is	
  already	
  on	
  the	
  cloud	
  
– Virtual	
  organiza<on	
  
– Cloud	
  based	
  SaaS	
  Service	
  
•  Big	
  Data	
  As	
  a	
  Service	
  on	
  the	
  Cloud	
  
– Private	
  Cloud	
  
– Public	
  Cloud	
  
Amazon
•  Amazon	
  EC2	
  
– Computa<on	
  Service	
  using	
  VM	
  
•  Amazon	
  DynamoDB	
  
– Large	
  scalable	
  NoSQL	
  databased	
  
– Fully	
  distributed	
  shared	
  nothing	
  architecture	
  
•  Amazon	
  Elas<c	
  MapReduce	
  (Amazon	
  EMR)	
  
– Hadoop	
  based	
  analysis	
  engine	
  
– Can	
  be	
  used	
  to	
  analyse	
  big	
  data	
  without	
  the	
  
need	
  to	
  build	
  the	
  infrastucture	
  
hEp://aws.amazon.com/big-­‐data/	
  
Google	
  Cloud	
  Plakorm
•  App	
  engines	
  	
  
–  mobile	
  and	
  web	
  app	
  
•  Cloud	
  SQL	
  
–  MySQL	
  on	
  the	
  cloud	
  
•  Cloud	
  Storage	
  
–  Data	
  storage	
  
•  Big	
  Query	
  
–  Data	
  analysis	
  
•  Google	
  Compute	
  Engine	
  
–  Processing	
  of	
  large	
  data
BIG	
  DATA	
  BENEFIT	
  AND	
  USE	
  CASE
Current	
  Trends	
  
•  Big	
  data	
  toward	
  real	
  
usage	
  
–  From	
  pilot	
  to	
  real	
  usage	
  
•  More	
  soLware	
  solu<on	
  
–  Infrastructure	
  
–  Analy<cs	
  
•  Sta<s<cal	
  Analysis	
  
•  Social	
  Graph	
  Analysis	
  
•  More	
  unstructured	
  data	
  
–  Facebook	
  ,	
  twiEer,	
  text	
  ,	
  
video,	
  image	
  	
  
Analy<cs	
  
Structured	
   Unstructured	
  
Big	
  Data	
  
Google	
  Flu	
  
•  paEern	
  emerges	
  when	
  all	
  the	
  flu-­‐
related	
  search	
  queries	
  are	
  added	
  
together.	
  	
  
•  We	
  compared	
  our	
  query	
  counts	
  with	
  
tradi<onal	
  flu	
  surveillance	
  systems	
  
and	
  found	
  that	
  many	
  search	
  queries	
  
tend	
  to	
  be	
  popular	
  exactly	
  when	
  flu	
  
season	
  is	
  happening.	
  	
  
•  By	
  coun<ng	
  how	
  oLen	
  we	
  see	
  these	
  
search	
  queries,	
  we	
  can	
  es<mate	
  how	
  
much	
  flu	
  is	
  circula<ng	
  in	
  different	
  
countries	
  and	
  regions	
  around	
  the	
  
world.	
  	
  
hEp://www.google.org/flutrends/
about/how.html	
  
WHAT	
  FACEBOOK	
  KNOWS	
  
hEp://www.facebook.com/data	
  
Cameron	
  Marlow	
  calls	
  himself	
  Facebook's	
  "in-­‐
house	
  sociologist."	
  He	
  and	
  his	
  team	
  can	
  analyze	
  
essen<ally	
  all	
  the	
  informa<on	
  the	
  site	
  gathers.	
  
Study	
  of	
  Human	
  Society	
  
•  Facebook,	
  in	
  collabora<on	
  with	
  the	
  University	
  
of	
  Milan,	
  conducted	
  experiment	
  that	
  involved	
  	
  
– the	
  en<re	
  social	
  network	
  as	
  of	
  May	
  2011	
  
– more	
  than	
  10	
  percent	
  of	
  the	
  world's	
  popula<on.	
  	
  
•  Analyzing	
  the	
  69	
  billion	
  friend	
  connec<ons	
  
among	
  those	
  721	
  million	
  people	
  showed	
  that	
  	
  
– four	
  intermediary	
  friends	
  are	
  usually	
  enough	
  to	
  
introduce	
  anyone	
  to	
  a	
  random	
  stranger.	
  
	
  
Why?	
  
•  Facebook	
  can	
  improve	
  users	
  experience	
  	
  	
  
– make	
  useful	
  predic<ons	
  about	
  users'	
  behavior	
  
– make	
  beEer	
  guesses	
  about	
  which	
  ads	
  you	
  might	
  
be	
  more	
  or	
  less	
  open	
  to	
  at	
  any	
  given	
  <me	
  
•  Right	
  before	
  Valen<ne's	
  Day	
  this	
  year	
  a	
  
blog	
  post	
  from	
  the	
  Data	
  Science	
  Team	
  listed	
  
the	
  songs	
  most	
  popular	
  with	
  people	
  who	
  had	
  
recently	
  signaled	
  on	
  Facebook	
  that	
  they	
  had	
  
entered	
  or	
  leL	
  a	
  rela<onship	
  
How	
  facebook	
  handle	
  Big	
  Data?	
  
•  Facebook	
  built	
  its	
  data	
  storage	
  system	
  using	
  open-­‐
source	
  soLware	
  called	
  Hadoop.	
  
–  Hadoop	
  spreading	
  them	
  across	
  many	
  machines	
  inside	
  a	
  
data	
  center.	
  
–  Use	
  Hive,	
  open-­‐source	
  that	
  acts	
  as	
  a	
  transla<on	
  service,	
  
making	
  it	
  possible	
  to	
  query	
  vast	
  Hadoop	
  data	
  stores	
  using	
  
rela<vely	
  simple	
  code.	
  
•  Much	
  of	
  Facebook's	
  data	
  resides	
  in	
  one	
  Hadoop	
  store	
  
more	
  than	
  100	
  petabytes	
  (a	
  million	
  gigabytes)	
  in	
  size,	
  
says	
  Sameet	
  Agarwal,	
  a	
  director	
  of	
  engineering	
  at	
  
Facebook	
  who	
  works	
  on	
  data	
  infrastructure,	
  and	
  the	
  
quan<ty	
  is	
  growing	
  exponen<ally.	
  "Over	
  the	
  last	
  few	
  
years	
  we	
  have	
  more	
  than	
  doubled	
  in	
  size	
  every	
  year,”	
  
eBay	
  	
  
•  eBay	
  is	
  using	
  Hadoop	
  technology	
  and	
  the	
  Hbase	
  database,	
  which	
  supports	
  real-­‐
<me	
  analysis	
  of	
  Hadoop	
  data,	
  to	
  build	
  a	
  new	
  search	
  engine	
  for	
  its	
  auc<on	
  site.	
  
–  97	
  million	
  ac<ve	
  buyers	
  and	
  sellers	
  	
  
–  over	
  200	
  million	
  items	
  for	
  sale	
  in	
  50,000	
  categories.	
  	
  
–  The	
  site	
  handles	
  close	
  to	
  2	
  billion	
  page	
  views.	
  
–  	
  250	
  million	
  search	
  queries	
  and	
  tens	
  of	
  billions	
  of	
  database	
  calls	
  daily.	
  
•  The	
  company	
  has	
  9	
  petabytes	
  of	
  data	
  stored	
  on	
  Hadoop	
  and	
  Teradata	
  clusters,	
  
and	
  the	
  amount	
  is	
  growing	
  quickly,	
  he	
  said.	
  
•  100	
  eBay	
  engineers	
  are	
  working	
  on	
  the	
  Cassini	
  project.	
  The	
  new	
  engine	
  is	
  
expected	
  to	
  respond	
  to	
  user	
  queries	
  with	
  results	
  that	
  are	
  context-­‐based	
  and	
  more	
  
accurate	
  than	
  those	
  provided	
  by	
  the	
  current	
  system.	
  
Source:	
  hEp://www.computerworld.com/ar<cle/2550078/data-­‐center/hadoop-­‐is-­‐ready-­‐for-­‐the-­‐enterprise-­‐-­‐it-­‐execs-­‐say.html	
  
•  JPMorgan	
  Chase	
  s<ll	
  relies	
  heavily	
  on	
  rela<onal	
  
database	
  systems	
  for	
  transac<on	
  processing.	
  
•  Hadoop	
  technology	
  is	
  used	
  for	
  a	
  growing	
  number	
  of	
  
purposes,	
  including	
  fraud	
  detecGon,	
  IT	
  risk	
  
management	
  and	
  self	
  service.	
  
–  With	
  over	
  150	
  petabytes	
  of	
  data	
  stored	
  online,	
  30,000	
  
databases	
  and	
  3.5	
  billion	
  log-­‐ins	
  to	
  user	
  accounts.	
  
•  Hadoop's	
  ability	
  to	
  store	
  vast	
  volumes	
  of	
  unstructured	
  
data	
  allows	
  the	
  company	
  to	
  collect	
  and	
  store	
  Web	
  
logs,	
  transac<on	
  data	
  and	
  social	
  media	
  data.	
  
•  The	
  data	
  is	
  aggregated	
  into	
  a	
  common	
  plakorm	
  for	
  
use	
  in	
  a	
  range	
  of	
  customer-­‐focused	
  data	
  mining	
  and	
  
data	
  analy<cs	
  tools.	
  
Source:	
  hEp://www.computerworld.com/ar<cle/2550078/data-­‐center/hadoop-­‐is-­‐ready-­‐for-­‐the-­‐enterprise-­‐-­‐it-­‐execs-­‐say.html	
  
Premier	
  
•  Premier,	
  the	
  U.S.	
  healthcare	
  alliance	
  network.	
  More	
  
than	
  2,700	
  members,	
  hospitals	
  and	
  health	
  systems,
90,000	
  non-­‐acute	
  facili<es	
  and	
  400,000	
  physicians	
  	
  
–  a	
  large	
  database	
  of	
  clinical,	
  financial,	
  pa<ent,and	
  supply	
  
chain	
  data	
  
–  generated	
  comprehensive	
  and	
  comparable	
  clinical	
  
outcome	
  measures,	
  resource	
  u<liza<on	
  reports	
  and	
  
transac<on	
  level	
  cost	
  data.	
  	
  
•  Big	
  data	
  is	
  used	
  to	
  improve	
  the	
  healthcare	
  processes	
  at	
  
approximately	
  330	
  hospitals,	
  saving	
  an	
  es<mated	
  
29,000	
  lives	
  and	
  reducing	
  healthcare	
  spending	
  by	
  
nearly	
  $7	
  billion	
  
Reference:	
  IBM:	
  Data	
  Driven	
  Healthcare	
  Organiza<ons	
  Use	
  Big	
  Data	
  Analy<cs	
  for	
  Big	
  
Gains;	
  2013.	
  hEp://www03.ibm.com/industries/ca/en/healthcare/	
  
documents/Data_driven_healthcare_organiza<ons_use_big_data_analy<cs_for_big_gains.pdf.	
  
Some	
  Sucesss	
  
•  The	
  Rizzoli	
  Orthopedic	
  Ins<tute	
  in	
  Bologna,	
  
Italy	
  	
  
– using	
  advanced	
  analy<cs	
  to	
  gain	
  a	
  more	
  “granular	
  
understanding”	
  of	
  the	
  clinical	
  varia<ons	
  within	
  
families	
  whereby	
  individual	
  pa<ents	
  display	
  
extreme	
  differences	
  in	
  the	
  severity	
  of	
  their	
  
symptoms.	
  	
  
•  The	
  insight	
  is	
  reported	
  to	
  have	
  reduced	
  annual	
  
hospitaliza<ons	
  by	
  30%	
  and	
  the	
  number	
  of	
  
imaging	
  tests	
  by	
  60%.	
  
Social	
  Media	
  Analy<cs	
  
•  Social	
  media	
  analyHcs	
  is	
  the	
  prac<ce	
  of	
  
gathering	
  data	
  from	
  blogs	
  and	
  social	
  
media	
  websites	
  and	
  analyzing	
  that	
  data	
  to	
  
make	
  business	
  decisions.	
  The	
  most	
  common	
  
use	
  of	
  social	
  media	
  analyHcs	
  is	
  to	
  mine	
  
customer	
  sen<ment	
  in	
  order	
  to	
  support	
  
marke<ng	
  and	
  customer	
  service	
  ac<vi<es.	
  
What	
  is	
  social	
  media	
  analy<cs?	
  -­‐	
  Defini<on	
  from	
  WhatIs.com	
  
Star<ng	
  a	
  Big	
  Data	
  Ini<a<ve	
  
Data	
  
Infrastructure	
  
Big	
  Data	
  Tools	
  
Analy<cs	
  SoLware	
  
Visualiza<on	
  
Top	
  Down	
  
BoEom	
  Up	
  
Data	
  Product	
  
•  Data	
  Product	
  provides	
  ac<onalble	
  informa<on	
  
without	
  exposing	
  decision	
  maker	
  to	
  the	
  
underlying	
  data	
  or	
  analy<cs	
  
– Movie	
  Recommenda<ons	
  
– Weather	
  Forecast	
  
– Stock	
  Market	
  Predic<on	
  
– Opera<on	
  improvement	
  
– Health	
  Diagnosis	
  
– Targeted	
  Adver<sing	
  	
  
Source:	
  The	
  Filed	
  Guide	
  to	
  Data	
  Science,	
  Booz,	
  Allen,	
  Hamilton	
  
BoEom	
  up	
  approach	
  
•  What	
  is	
  the	
  data	
  that	
  we	
  have?	
  
•  How	
  can	
  we	
  collect	
  and	
  store	
  it?	
  
•  What	
  is	
  the	
  infrastructure	
  and	
  
tool	
  to	
  process	
  this	
  big	
  data?	
  
•  What	
  analy<cs	
  method	
  can	
  be	
  
apply?	
  
•  What	
  is	
  the	
  insight	
  we	
  can	
  gain	
  
from	
  this	
  data	
  and	
  analysis?	
  
Top	
  down	
  
•  What	
  is	
  the	
  business	
  
challenge	
  that	
  can	
  create	
  
value	
  and	
  impact	
  to	
  the	
  
organiza<on?	
  
•  What	
  is	
  the	
  data	
  that	
  we	
  
need?	
  
•  What	
  is	
  the	
  tools	
  and	
  analy<cs	
  
approach	
  that	
  should	
  be	
  
used	
  ?	
  
•  What	
  is	
  the	
  infrastructure	
  
needed?	
  	
  
Some	
  thought	
  
•  BoEom	
  up	
  approach	
  may	
  be	
  good	
  when	
  you	
  do	
  not	
  know	
  
how	
  to	
  start?	
  
•  Pick	
  some	
  easy	
  ques<on	
  and	
  start	
  a	
  pilot	
  
–  Learning	
  infrastructure	
  technology,	
  analy<c	
  technology	
  and	
  
tools	
  
–  Using	
  data	
  you	
  already	
  have	
  	
  
•  Top	
  down	
  that	
  focus	
  on	
  business	
  value	
  is	
  beEer	
  but	
  
challenging	
  
–  Hard	
  to	
  ask	
  a	
  good	
  ques<on,	
  need	
  management	
  to	
  iden<fy	
  the	
  
need	
  	
  
–  May	
  have	
  to	
  ask	
  many	
  ques<ons	
  and	
  pick	
  the	
  right	
  one	
  based	
  
on	
  
•  Impact	
  and	
  value	
  
•  	
  	
  
Example:	
  What	
  is/is	
  not	
  	
  big	
  data	
  
problem?	
  
•  I	
  want	
  to	
  classify	
  the	
  legal	
  documents	
  to	
  make	
  
it	
  easy	
  to	
  process	
  these	
  documents	
  
•  I	
  want	
  to	
  learn	
  how	
  our	
  customer	
  react	
  to	
  our	
  
new	
  Tee-­‐shirt	
  
•  I	
  want	
  to	
  understand	
  how	
  our	
  students	
  use	
  
facebook	
  
Some	
  Trends	
  
Trend:	
  	
  Informa<on	
  Tsunami	
  is	
  coming!	
  
Informa<on	
  Tsunami	
  
•  Rapid	
  expansion	
  of	
  Smartphone	
  Usage,	
  social	
  
compu<ng,	
  mobile	
  applica<on,	
  gaming	
  
•  Rapid	
  increases	
  in	
  Network	
  Bandwidth	
  and	
  coverage	
  
–  Wifi,	
  4G	
  	
  
•  Rapid	
  move	
  toward	
  Internet	
  of	
  Things	
  (IOT)	
  
–  Sensor	
  everywhere,	
  mul<media	
  informa<on	
  
Trend:	
  	
  Big	
  data	
  infrastructure	
  
becomes	
  even	
  more	
  powerful	
  and	
  
easy	
  to	
  use	
  
	
  
In-­‐memory	
  Database	
  
•  An	
  in-­‐memory	
  database	
  is	
  	
  
–  a	
  database	
  management	
  system	
  that	
  
primarily	
  relies	
  on	
  main	
  
memory	
  for	
  computer	
  data	
  storage.	
  	
  
–  faster	
  than	
  disk-­‐op<mized	
  databases	
  
since	
  the	
  internal	
  op<miza<on	
  
algorithms	
  are	
  simpler	
  and	
  execute	
  
fewer	
  CPU	
  instruc<ons.	
  
–  	
  Accessing	
  data	
  in	
  memory	
  
eliminates	
  seek	
  <me	
  when	
  querying	
  
the	
  data,	
  which	
  provides	
  faster	
  and	
  
more	
  predictable	
  performance	
  than	
  
disk.	
  
Source:	
  hEp://en.wikipedia.org/wiki/In-­‐memory_database	
  
What	
  is	
  Spark?	
  
Efficient	
  
•  General	
  execu<on	
  graphs	
  
•  In-­‐memory	
  storage	
  
Usable	
  
•  Rich	
  APIs	
  in	
  Java,	
  Scala,	
  
Python	
  
•  Interac<ve	
  shell	
  
Fast and Expressive Cluster Computing !
Engine Compatible with Apache Hadoop
2-­‐5×	
  less	
  code	
  
Up	
  to	
  10×	
  faster	
  on	
  disk,	
  100×	
  in	
  memory	
  
Spark	
  at	
  Yahoo	
  
•  Personalizing	
  news	
  pages	
  for	
  Web	
  visitors	
  and	
  
another	
  for	
  running	
  analy<cs	
  for	
  adver<sing.	
  
For	
  news	
  personaliza<on,	
  the	
  company	
  uses	
  
ML	
  algorithms	
  running	
  on	
  Spark	
  to	
  figure	
  out	
  
what	
  individual	
  users	
  are	
  interested	
  in,	
  and	
  
also	
  to	
  categorize	
  news	
  stories	
  as	
  they	
  arise	
  to	
  
figure	
  out	
  what	
  types	
  of	
  users	
  would	
  be	
  
interested	
  in	
  reading	
  them.	
  
–  wrote	
  a	
  Spark	
  ML	
  algorithm	
  120	
  lines	
  of	
  Scala.	
  
(Previously,	
  its	
  ML	
  algorithm	
  for	
  news	
  
personaliza<on	
  was	
  wriEen	
  in	
  15,000	
  lines	
  of	
  C++.)	
  
–  With	
  just	
  30	
  minutes	
  of	
  training	
  on	
  a	
  large,	
  hundred	
  
million	
  record	
  data	
  set,	
  the	
  Scala	
  ML	
  algorithm	
  was	
  
ready	
  for	
  business.	
  
•  Second	
  use	
  case	
  shows	
  off	
  Hive	
  on	
  Spark	
  
(Shark’s)	
  interac<ve	
  capability.	
  	
  
–  use	
  exis<ng	
  BI	
  tools	
  to	
  view	
  and	
  query	
  their	
  
adver<sing	
  analy<c	
  data	
  collected	
  in	
  Hadoop.	
  	
  
hEp://www.datanami.com/2014/03/06/apache_spark_3_real-­‐
world_use_cases/	
  
BigData	
  Infrastructure	
  Goes	
  to	
  Cloud	
  
•  Data	
  is	
  already	
  on	
  the	
  cloud	
  
–  Virtual	
  organiza<on	
  
–  Cloud	
  based	
  SaaS	
  Service	
  
•  Big	
  Data	
  As	
  a	
  Service	
  on	
  the	
  Cloud	
  
–  Private	
  Cloud	
  
–  Public	
  Cloud	
  
•  IBM	
  Bluemix,	
  Amazon	
  AWS	
  (EMR)	
  and	
  many	
  	
  
Big	
  Data	
  
Services	
  
Services	
  
App	
  
App	
  
Trend:	
  	
  Big	
  data	
  is	
  moving	
  
toward	
  the	
  real	
  usage	
  
	
  
Trends	
  
•  Big	
  data	
  toward	
  real	
  usage	
  
–  From	
  pilot	
  to	
  real	
  usage	
  
•  More	
  soLware	
  solu<on	
  
–  Infrastructure	
  
–  Analy<cs	
  
•  Sta<s<cal	
  Analysis	
  
•  Social	
  Graph	
  Analysis	
  with	
  
machine	
  learning	
  
•  More	
  unstructured	
  data	
  
–  Facebook	
  ,	
  twiEer,	
  text	
  ,	
  
video,	
  image	
  	
  
Analy<cs	
  
Structured	
   Unstructured	
  
Big	
  Data	
  
Trend	
  :	
  much	
  smarter	
  data	
  analy<cs	
  
is	
  coming	
  	
  
	
  
Big	
  Data	
  Analy<cs	
  
•  a	
  set	
  of	
  advanced	
  technologies	
  
designed	
  to	
  work	
  with	
  large	
  
volumes	
  of	
  heterogeneous	
  data.	
  	
  
•  explore	
  the	
  data	
  and	
  to	
  discover	
  
interrela<onships	
  and	
  paEerns	
  
using	
  	
  sophis<cated	
  quan<ta<ve	
  
methods	
  such	
  as	
  	
  
•  machine	
  learning	
  
•  neural	
  networks	
  
•  robo<cs	
  algorithm	
  	
  
•  computa<onal	
  mathema<cs	
  
•  ar<ficial	
  intelligence	
  	
  
Deep	
  Learning	
  
•  Deep	
  learning	
  is	
  a	
  subcategory	
  of	
  machine	
  learning	
  
with	
  the	
  use	
  of	
  neural	
  networks	
  to	
  improve	
  things	
  
like	
  speech	
  recogni<on,	
  computer	
  vision,	
  
and	
  natural	
  language	
  processing.	
  	
  
–  Unsupervised	
  learning	
  for	
  abstract	
  concept	
  
Applying	
  Deep	
  Learning	
  
•  In	
  2011,	
  Stanford	
  computer	
  science	
  professor	
  Andrew	
  Ng	
  founded	
  Google’s	
  Google	
  
Brain	
  project,	
  which	
  created	
  a	
  neural	
  network	
  trained	
  with	
  deep	
  learning	
  
algorithms,	
  which	
  famously	
  proved	
  capable	
  ofrecognizing	
  high	
  level	
  concepts,	
  such	
  
as	
  cats,	
  aLer	
  watching	
  just	
  YouTube	
  videos-­‐-­‐and	
  without	
  ever	
  having	
  been	
  told	
  
what	
  a	
  “cat”	
  is.	
  
•  Facebook	
  using	
  deep	
  learning	
  exper<se	
  to	
  help	
  create	
  solu<ons	
  that	
  will	
  beEer	
  
iden<fy	
  faces	
  and	
  objects	
  in	
  the	
  350	
  million	
  photos	
  and	
  videos	
  uploaded	
  to	
  
Facebook	
  each	
  day.	
  
•  Voice	
  recogni<on	
  like	
  Google	
  Now	
  and	
  Apple’s	
  Siri	
  is	
  now	
  using	
  deep	
  learning.	
  
–  According	
  to	
  Google	
  researchers,	
  the	
  voice	
  error	
  rate	
  in	
  the	
  new	
  version	
  of	
  Android-­‐-­‐aLer	
  
adding	
  insights	
  from	
  deep	
  learning-­‐-­‐stands	
  at	
  25%	
  lower	
  than	
  previous	
  versions	
  of	
  the	
  
soLware.	
  	
  
Source:	
  h#p://www.fastcolabs.com/3026423/why-­‐google-­‐is-­‐invesGng-­‐in-­‐deep-­‐learning	
  
h#p://www.wired.com/2014/08/deep-­‐learning-­‐yann-­‐lecun/	
  
IBM	
  Watson	
  and	
  Cogni<ve	
  Technology	
  
•  Watson	
  is	
  a	
  cogni<ve	
  
technology	
  that	
  processes	
  
informa<on	
  more	
  like	
  a	
  human	
  than	
  
a	
  computer—by	
  understanding	
  
natural	
  language,	
  genera<ng	
  
hypotheses	
  based	
  on	
  evidence,	
  and	
  
learning	
  as	
  it	
  goes.	
  And	
  learn	
  it	
  does.	
  	
  
•  Watson	
  “gets	
  smarter”	
  in	
  three	
  
ways:	
  	
  
–  being	
  taught	
  by	
  its	
  users	
  
–  	
  learning	
  from	
  prior	
  interac<ons	
  
–  being	
  presented	
  with	
  new	
  informa<on.	
  	
  
•  This	
  means	
  organiza<ons	
  can	
  more	
  
fully	
  understand	
  and	
  use	
  the	
  data	
  
that	
  surrounds	
  them,	
  and	
  use	
  that	
  
data	
  to	
  make	
  beEer	
  decisions.	
  
Applying	
  Watson	
  in	
  Healthcare	
  
•  WellPoint,	
  Inc.	
  is	
  an	
  Indianapolis-­‐based	
  health	
  
benefits	
  company.	
  	
  
–  approximately	
  37	
  million	
  health	
  plan	
  members	
  	
  
–  processes	
  more	
  than	
  	
  550	
  million	
  claims	
  per	
  year.	
  	
  
•  Using	
  IBM	
  Watson™	
  to	
  improve	
  the	
  quality	
  	
  and	
  
efficiency	
  of	
  healthcare	
  decisions.	
  
–  WellPoint	
  trained	
  Watson	
  with	
  25,000	
  historical	
  
cases.	
  Now	
  Watson	
  uses	
  hypothesis	
  genera<on	
  and	
  
evidence-­‐based	
  learning	
  to	
  generate	
  confidence-­‐
scored	
  recommenda<ons	
  that	
  help	
  nurses	
  make	
  
decisions	
  about	
  u<liza<on	
  management.	
  Natural	
  
language	
  processing	
  leverages	
  unstructured	
  data,	
  
such	
  as	
  text-­‐based	
  Treatment	
  requests.	
  	
  
•  Benefit	
  
–  Helps	
  UM	
  nurses	
  make	
  faster	
  UM	
  decisions	
  about	
  
treatment	
  requests	
  
–  Could	
  accelerate	
  healthcare	
  preapprovals,	
  which	
  can	
  
be	
  cri<cal	
  when	
  treatments	
  are	
  <me-­‐sensi<ve	
  
–  Includes	
  unstructured	
  data	
  in	
  the	
  streamlined	
  
decision	
  process	
  
Challenges	
  
•  Developing	
  Big	
  Data	
  Applica<on	
  is	
  not	
  simple	
  
– New	
  algorithm,	
  new	
  soLware	
  development	
  tools	
  	
  
•  Proper	
  policy	
  about	
  data	
  security	
  and	
  
ownership	
  
•  Lack	
  of	
  Data	
  Scien<sts	
  
– Different	
  from	
  SoLware	
  Developer	
  
	
  
Have	
  fun	
  with	
  your	
  Big	
  Data	
  
Advanture!	
  

More Related Content

What's hot

Big data analytics with Apache Hadoop
Big data analytics with Apache  HadoopBig data analytics with Apache  Hadoop
Big data analytics with Apache Hadoop
Suman Saurabh
 
Big Data - Applications and Technologies Overview
Big Data - Applications and Technologies OverviewBig Data - Applications and Technologies Overview
Big Data - Applications and Technologies Overview
Sivashankar Ganapathy
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
Karan Desai
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
Kristof Jozsa
 
Big Data Analysis Patterns with Hadoop, Mahout and Solr
Big Data Analysis Patterns with Hadoop, Mahout and SolrBig Data Analysis Patterns with Hadoop, Mahout and Solr
Big Data Analysis Patterns with Hadoop, Mahout and Solr
boorad
 
Big Data vs Data Science vs Data Analytics | Demystifying The Difference | Ed...
Big Data vs Data Science vs Data Analytics | Demystifying The Difference | Ed...Big Data vs Data Science vs Data Analytics | Demystifying The Difference | Ed...
Big Data vs Data Science vs Data Analytics | Demystifying The Difference | Ed...
Edureka!
 
Are you ready for BIG DATA?
Are you ready for BIG DATA?Are you ready for BIG DATA?
Are you ready for BIG DATA?
Putchong Uthayopas
 
Big Data Analytics MIS presentation
Big Data Analytics MIS presentationBig Data Analytics MIS presentation
Big Data Analytics MIS presentationAASTHA PANDEY
 
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of TechnologyGuest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
Nishant Gandhi
 
Big Data Hadoop Training by Easylearning Guru
Big Data Hadoop Training by Easylearning GuruBig Data Hadoop Training by Easylearning Guru
Big Data Hadoop Training by Easylearning Guru
KCC Software Ltd. & Easylearning.guru
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
Tyrone Systems
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
Sreedhar Chowdam
 
Big data – a brief overview
Big data – a brief overviewBig data – a brief overview
Big data – a brief overviewDorai Thodla
 
Big data analysis concepts and references
Big data analysis concepts and referencesBig data analysis concepts and references
Big data analysis concepts and references
Information Security Awareness Group
 
The Future Of Big Data
The Future Of Big DataThe Future Of Big Data
The Future Of Big Data
Matthew Dennis
 
Big data Big Analytics
Big data Big AnalyticsBig data Big Analytics
Big data Big AnalyticsAjay Ohri
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
Joey Li
 
Big data analytics, survey r.nabati
Big data analytics, survey r.nabatiBig data analytics, survey r.nabati
Big data analytics, survey r.nabati
nabati
 
BDaas- BigData as a service
BDaas- BigData as a service  BDaas- BigData as a service
BDaas- BigData as a service
Agile Testing Alliance
 

What's hot (20)

Big data analytics with Apache Hadoop
Big data analytics with Apache  HadoopBig data analytics with Apache  Hadoop
Big data analytics with Apache Hadoop
 
Big Data - Applications and Technologies Overview
Big Data - Applications and Technologies OverviewBig Data - Applications and Technologies Overview
Big Data - Applications and Technologies Overview
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
Big Data Analysis Patterns with Hadoop, Mahout and Solr
Big Data Analysis Patterns with Hadoop, Mahout and SolrBig Data Analysis Patterns with Hadoop, Mahout and Solr
Big Data Analysis Patterns with Hadoop, Mahout and Solr
 
Big Data vs Data Science vs Data Analytics | Demystifying The Difference | Ed...
Big Data vs Data Science vs Data Analytics | Demystifying The Difference | Ed...Big Data vs Data Science vs Data Analytics | Demystifying The Difference | Ed...
Big Data vs Data Science vs Data Analytics | Demystifying The Difference | Ed...
 
Are you ready for BIG DATA?
Are you ready for BIG DATA?Are you ready for BIG DATA?
Are you ready for BIG DATA?
 
Big Data Analytics MIS presentation
Big Data Analytics MIS presentationBig Data Analytics MIS presentation
Big Data Analytics MIS presentation
 
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of TechnologyGuest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
 
Big Data Hadoop Training by Easylearning Guru
Big Data Hadoop Training by Easylearning GuruBig Data Hadoop Training by Easylearning Guru
Big Data Hadoop Training by Easylearning Guru
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Big data – a brief overview
Big data – a brief overviewBig data – a brief overview
Big data – a brief overview
 
Big data analysis concepts and references
Big data analysis concepts and referencesBig data analysis concepts and references
Big data analysis concepts and references
 
The Future Of Big Data
The Future Of Big DataThe Future Of Big Data
The Future Of Big Data
 
Big data Big Analytics
Big data Big AnalyticsBig data Big Analytics
Big data Big Analytics
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Big data analytics, survey r.nabati
Big data analytics, survey r.nabatiBig data analytics, survey r.nabati
Big data analytics, survey r.nabati
 
BDaas- BigData as a service
BDaas- BigData as a service  BDaas- BigData as a service
BDaas- BigData as a service
 

Viewers also liked

Thailand Hadoop Big Data Challenge #1
Thailand Hadoop Big Data Challenge #1Thailand Hadoop Big Data Challenge #1
Thailand Hadoop Big Data Challenge #1
IMC Institute
 
Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR
Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMRBig Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR
Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR
IMC Institute
 
Hadoop Workshop on EC2 : March 2015
Hadoop Workshop on EC2 : March 2015Hadoop Workshop on EC2 : March 2015
Hadoop Workshop on EC2 : March 2015
IMC Institute
 
Cloud Computing สำหรับ ผู้บริหารเพื่อรองรับเศรษฐกิจดิจิทัล
Cloud Computing สำหรับ ผู้บริหารเพื่อรองรับเศรษฐกิจดิจิทัลCloud Computing สำหรับ ผู้บริหารเพื่อรองรับเศรษฐกิจดิจิทัล
Cloud Computing สำหรับ ผู้บริหารเพื่อรองรับเศรษฐกิจดิจิทัลIMC Institute
 
การบริหารจัดการระบบ Cloud Computing สำหรับองค์กรธุรกิจ SME
การบริหารจัดการระบบ  Cloud Computing  สำหรับองค์กรธุรกิจ SMEการบริหารจัดการระบบ  Cloud Computing  สำหรับองค์กรธุรกิจ SME
การบริหารจัดการระบบ Cloud Computing สำหรับองค์กรธุรกิจ SME
IMC Institute
 
Mahout Workshop on Google Cloud Platform
Mahout Workshop on Google Cloud PlatformMahout Workshop on Google Cloud Platform
Mahout Workshop on Google Cloud Platform
IMC Institute
 
Big Data Analytics Using Hadoop Cluster On Amazon EMR
Big Data Analytics Using Hadoop Cluster  On Amazon EMRBig Data Analytics Using Hadoop Cluster  On Amazon EMR
Big Data Analytics Using Hadoop Cluster On Amazon EMR
IMC Institute
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
IMC Institute
 
Big Data Analytics using Mahout
Big Data Analytics using MahoutBig Data Analytics using Mahout
Big Data Analytics using Mahout
IMC Institute
 
Introduction to Data Mining, Business Intelligence and Data Science
Introduction to Data Mining, Business Intelligence and Data ScienceIntroduction to Data Mining, Business Intelligence and Data Science
Introduction to Data Mining, Business Intelligence and Data Science
IMC Institute
 
Thailand ICT Review 2014
Thailand ICT Review 2014Thailand ICT Review 2014
Thailand ICT Review 2014
IMC Institute
 
Analyse Tweets using Flume, Hadoop and Hive
Analyse Tweets using Flume, Hadoop and HiveAnalyse Tweets using Flume, Hadoop and Hive
Analyse Tweets using Flume, Hadoop and Hive
IMC Institute
 
Big Data as a Service
Big Data as a ServiceBig Data as a Service
Big Data as a Service
IMC Institute
 
Analyse Tweets using Flume 1.4, Hadoop 2.7 and Hive
Analyse Tweets using Flume 1.4, Hadoop 2.7 and HiveAnalyse Tweets using Flume 1.4, Hadoop 2.7 and Hive
Analyse Tweets using Flume 1.4, Hadoop 2.7 and Hive
IMC Institute
 
Mobile User and App Analytics in China
Mobile User and App Analytics in ChinaMobile User and App Analytics in China
Mobile User and App Analytics in China
IMC Institute
 
Cloud Computing in Thailand Readiness Survey 2015 & IT Trends Prediction 2016
Cloud Computing in Thailand Readiness Survey 2015 & IT Trends Prediction 2016Cloud Computing in Thailand Readiness Survey 2015 & IT Trends Prediction 2016
Cloud Computing in Thailand Readiness Survey 2015 & IT Trends Prediction 2016
IMC Institute
 
Hadoop Hand-on Lab: Installing Hadoop 2
Hadoop Hand-on Lab: Installing Hadoop 2Hadoop Hand-on Lab: Installing Hadoop 2
Hadoop Hand-on Lab: Installing Hadoop 2
IMC Institute
 
Big data project management
Big data project managementBig data project management
Big data project management
IMC Institute
 
Thai Software & Software Market Survey 2015
Thai Software & Software Market Survey 2015Thai Software & Software Market Survey 2015
Thai Software & Software Market Survey 2015
IMC Institute
 
Big data processing using Cloudera Quickstart
Big data processing using Cloudera QuickstartBig data processing using Cloudera Quickstart
Big data processing using Cloudera Quickstart
IMC Institute
 

Viewers also liked (20)

Thailand Hadoop Big Data Challenge #1
Thailand Hadoop Big Data Challenge #1Thailand Hadoop Big Data Challenge #1
Thailand Hadoop Big Data Challenge #1
 
Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR
Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMRBig Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR
Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR
 
Hadoop Workshop on EC2 : March 2015
Hadoop Workshop on EC2 : March 2015Hadoop Workshop on EC2 : March 2015
Hadoop Workshop on EC2 : March 2015
 
Cloud Computing สำหรับ ผู้บริหารเพื่อรองรับเศรษฐกิจดิจิทัล
Cloud Computing สำหรับ ผู้บริหารเพื่อรองรับเศรษฐกิจดิจิทัลCloud Computing สำหรับ ผู้บริหารเพื่อรองรับเศรษฐกิจดิจิทัล
Cloud Computing สำหรับ ผู้บริหารเพื่อรองรับเศรษฐกิจดิจิทัล
 
การบริหารจัดการระบบ Cloud Computing สำหรับองค์กรธุรกิจ SME
การบริหารจัดการระบบ  Cloud Computing  สำหรับองค์กรธุรกิจ SMEการบริหารจัดการระบบ  Cloud Computing  สำหรับองค์กรธุรกิจ SME
การบริหารจัดการระบบ Cloud Computing สำหรับองค์กรธุรกิจ SME
 
Mahout Workshop on Google Cloud Platform
Mahout Workshop on Google Cloud PlatformMahout Workshop on Google Cloud Platform
Mahout Workshop on Google Cloud Platform
 
Big Data Analytics Using Hadoop Cluster On Amazon EMR
Big Data Analytics Using Hadoop Cluster  On Amazon EMRBig Data Analytics Using Hadoop Cluster  On Amazon EMR
Big Data Analytics Using Hadoop Cluster On Amazon EMR
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Big Data Analytics using Mahout
Big Data Analytics using MahoutBig Data Analytics using Mahout
Big Data Analytics using Mahout
 
Introduction to Data Mining, Business Intelligence and Data Science
Introduction to Data Mining, Business Intelligence and Data ScienceIntroduction to Data Mining, Business Intelligence and Data Science
Introduction to Data Mining, Business Intelligence and Data Science
 
Thailand ICT Review 2014
Thailand ICT Review 2014Thailand ICT Review 2014
Thailand ICT Review 2014
 
Analyse Tweets using Flume, Hadoop and Hive
Analyse Tweets using Flume, Hadoop and HiveAnalyse Tweets using Flume, Hadoop and Hive
Analyse Tweets using Flume, Hadoop and Hive
 
Big Data as a Service
Big Data as a ServiceBig Data as a Service
Big Data as a Service
 
Analyse Tweets using Flume 1.4, Hadoop 2.7 and Hive
Analyse Tweets using Flume 1.4, Hadoop 2.7 and HiveAnalyse Tweets using Flume 1.4, Hadoop 2.7 and Hive
Analyse Tweets using Flume 1.4, Hadoop 2.7 and Hive
 
Mobile User and App Analytics in China
Mobile User and App Analytics in ChinaMobile User and App Analytics in China
Mobile User and App Analytics in China
 
Cloud Computing in Thailand Readiness Survey 2015 & IT Trends Prediction 2016
Cloud Computing in Thailand Readiness Survey 2015 & IT Trends Prediction 2016Cloud Computing in Thailand Readiness Survey 2015 & IT Trends Prediction 2016
Cloud Computing in Thailand Readiness Survey 2015 & IT Trends Prediction 2016
 
Hadoop Hand-on Lab: Installing Hadoop 2
Hadoop Hand-on Lab: Installing Hadoop 2Hadoop Hand-on Lab: Installing Hadoop 2
Hadoop Hand-on Lab: Installing Hadoop 2
 
Big data project management
Big data project managementBig data project management
Big data project management
 
Thai Software & Software Market Survey 2015
Thai Software & Software Market Survey 2015Thai Software & Software Market Survey 2015
Thai Software & Software Market Survey 2015
 
Big data processing using Cloudera Quickstart
Big data processing using Cloudera QuickstartBig data processing using Cloudera Quickstart
Big data processing using Cloudera Quickstart
 

Similar to Introduction to Big Data

Data analytics & its Trends
Data analytics & its TrendsData analytics & its Trends
Data analytics & its Trends
Dr.K.Sreenivas Rao
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
Abhishek Roy
 
Big Data
Big DataBig Data
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
Sri Kanth
 
Big data4businessusers
Big data4businessusersBig data4businessusers
Big data4businessusers
Bob Hardaway
 
Big Data
Big DataBig Data
Big Data
Kirubaburi R
 
Big data analytics - hadoop
Big data analytics - hadoopBig data analytics - hadoop
Big data analytics - hadoop
Vishwajeet Jadeja
 
Big data and Hadoop overview
Big data and Hadoop overviewBig data and Hadoop overview
Big data and Hadoop overview
Nitesh Ghosh
 
Oh! Session on Introduction to BIG Data
Oh! Session on Introduction to BIG DataOh! Session on Introduction to BIG Data
Oh! Session on Introduction to BIG Data
Prakalp Agarwal
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.ppt
almaraniabwmalk
 
TSE_Pres12.pptx
TSE_Pres12.pptxTSE_Pres12.pptx
TSE_Pres12.pptx
ssuseracaaae2
 
Unit-1 -2-3- BDA PIET 6 AIDS.pptx
Unit-1 -2-3- BDA PIET 6 AIDS.pptxUnit-1 -2-3- BDA PIET 6 AIDS.pptx
Unit-1 -2-3- BDA PIET 6 AIDS.pptx
YashiBatra1
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystem
nallagangus
 
Big-Data-Seminar-6-Aug-2014-Koenig
Big-Data-Seminar-6-Aug-2014-KoenigBig-Data-Seminar-6-Aug-2014-Koenig
Big-Data-Seminar-6-Aug-2014-KoenigManish Chopra
 
Big data
Big dataBig data
Big data
Pietro Nardone
 
Hadoop HDFS.ppt
Hadoop HDFS.pptHadoop HDFS.ppt
Hadoop HDFS.ppt
6535ANURAGANURAG
 
Big Data
Big DataBig Data
Big Data
Neha Mehta
 
Big data with Hadoop - Introduction
Big data with Hadoop - IntroductionBig data with Hadoop - Introduction
Big data with Hadoop - Introduction
Tomy Rhymond
 
Overview of Big Data by Sunny
Overview of Big Data by SunnyOverview of Big Data by Sunny
Overview of Big Data by Sunny
DignitasDigital1
 

Similar to Introduction to Big Data (20)

Data analytics & its Trends
Data analytics & its TrendsData analytics & its Trends
Data analytics & its Trends
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
 
Big Data
Big DataBig Data
Big Data
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Big data4businessusers
Big data4businessusersBig data4businessusers
Big data4businessusers
 
Big Data
Big DataBig Data
Big Data
 
Big data analytics - hadoop
Big data analytics - hadoopBig data analytics - hadoop
Big data analytics - hadoop
 
Big data and Hadoop overview
Big data and Hadoop overviewBig data and Hadoop overview
Big data and Hadoop overview
 
Oh! Session on Introduction to BIG Data
Oh! Session on Introduction to BIG DataOh! Session on Introduction to BIG Data
Oh! Session on Introduction to BIG Data
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.ppt
 
TSE_Pres12.pptx
TSE_Pres12.pptxTSE_Pres12.pptx
TSE_Pres12.pptx
 
Unit-1 -2-3- BDA PIET 6 AIDS.pptx
Unit-1 -2-3- BDA PIET 6 AIDS.pptxUnit-1 -2-3- BDA PIET 6 AIDS.pptx
Unit-1 -2-3- BDA PIET 6 AIDS.pptx
 
unit 1 big data.pptx
unit 1 big data.pptxunit 1 big data.pptx
unit 1 big data.pptx
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystem
 
Big-Data-Seminar-6-Aug-2014-Koenig
Big-Data-Seminar-6-Aug-2014-KoenigBig-Data-Seminar-6-Aug-2014-Koenig
Big-Data-Seminar-6-Aug-2014-Koenig
 
Big data
Big dataBig data
Big data
 
Hadoop HDFS.ppt
Hadoop HDFS.pptHadoop HDFS.ppt
Hadoop HDFS.ppt
 
Big Data
Big DataBig Data
Big Data
 
Big data with Hadoop - Introduction
Big data with Hadoop - IntroductionBig data with Hadoop - Introduction
Big data with Hadoop - Introduction
 
Overview of Big Data by Sunny
Overview of Big Data by SunnyOverview of Big Data by Sunny
Overview of Big Data by Sunny
 

More from IMC Institute

นิตยสาร Digital Trends ฉบับที่ 14
นิตยสาร Digital Trends ฉบับที่ 14นิตยสาร Digital Trends ฉบับที่ 14
นิตยสาร Digital Trends ฉบับที่ 14
IMC Institute
 
Digital trends Vol 4 No. 13 Sep-Dec 2019
Digital trends Vol 4 No. 13  Sep-Dec 2019Digital trends Vol 4 No. 13  Sep-Dec 2019
Digital trends Vol 4 No. 13 Sep-Dec 2019
IMC Institute
 
บทความ The evolution of AI
บทความ The evolution of AIบทความ The evolution of AI
บทความ The evolution of AI
IMC Institute
 
IT Trends eMagazine Vol 4. No.12
IT Trends eMagazine  Vol 4. No.12IT Trends eMagazine  Vol 4. No.12
IT Trends eMagazine Vol 4. No.12
IMC Institute
 
เพราะเหตุใด Digitization ไม่ตอบโจทย์ Digital Transformation
เพราะเหตุใด Digitization ไม่ตอบโจทย์ Digital Transformationเพราะเหตุใด Digitization ไม่ตอบโจทย์ Digital Transformation
เพราะเหตุใด Digitization ไม่ตอบโจทย์ Digital Transformation
IMC Institute
 
IT Trends 2019: Putting Digital Transformation to Work
IT Trends 2019: Putting Digital Transformation to WorkIT Trends 2019: Putting Digital Transformation to Work
IT Trends 2019: Putting Digital Transformation to Work
IMC Institute
 
มูลค่าตลาดดิจิทัลไทย 3 อุตสาหกรรม
มูลค่าตลาดดิจิทัลไทย 3 อุตสาหกรรมมูลค่าตลาดดิจิทัลไทย 3 อุตสาหกรรม
มูลค่าตลาดดิจิทัลไทย 3 อุตสาหกรรม
IMC Institute
 
IT Trends eMagazine Vol 4. No.11
IT Trends eMagazine  Vol 4. No.11IT Trends eMagazine  Vol 4. No.11
IT Trends eMagazine Vol 4. No.11
IMC Institute
 
แนวทางการทำ Digital transformation
แนวทางการทำ Digital transformationแนวทางการทำ Digital transformation
แนวทางการทำ Digital transformation
IMC Institute
 
บทความ The New Silicon Valley
บทความ The New Silicon Valleyบทความ The New Silicon Valley
บทความ The New Silicon Valley
IMC Institute
 
นิตยสาร IT Trends ของ IMC Institute ฉบับที่ 10
นิตยสาร IT Trends ของ  IMC Institute  ฉบับที่ 10นิตยสาร IT Trends ของ  IMC Institute  ฉบับที่ 10
นิตยสาร IT Trends ของ IMC Institute ฉบับที่ 10
IMC Institute
 
แนวทางการทำ Digital transformation
แนวทางการทำ Digital transformationแนวทางการทำ Digital transformation
แนวทางการทำ Digital transformation
IMC Institute
 
The Power of Big Data for a new economy (Sample)
The Power of Big Data for a new economy (Sample)The Power of Big Data for a new economy (Sample)
The Power of Big Data for a new economy (Sample)
IMC Institute
 
บทความ Robotics แนวโน้มใหม่สู่บริการเฉพาะทาง
บทความ Robotics แนวโน้มใหม่สู่บริการเฉพาะทาง บทความ Robotics แนวโน้มใหม่สู่บริการเฉพาะทาง
บทความ Robotics แนวโน้มใหม่สู่บริการเฉพาะทาง
IMC Institute
 
IT Trends eMagazine Vol 3. No.9
IT Trends eMagazine  Vol 3. No.9 IT Trends eMagazine  Vol 3. No.9
IT Trends eMagazine Vol 3. No.9
IMC Institute
 
Thailand software & software market survey 2016
Thailand software & software market survey 2016Thailand software & software market survey 2016
Thailand software & software market survey 2016
IMC Institute
 
Developing Business Blockchain Applications on Hyperledger
Developing Business  Blockchain Applications on Hyperledger Developing Business  Blockchain Applications on Hyperledger
Developing Business Blockchain Applications on Hyperledger
IMC Institute
 
Digital transformation @thanachart.org
Digital transformation @thanachart.orgDigital transformation @thanachart.org
Digital transformation @thanachart.org
IMC Institute
 
บทความ Big Data จากบล็อก thanachart.org
บทความ Big Data จากบล็อก thanachart.orgบทความ Big Data จากบล็อก thanachart.org
บทความ Big Data จากบล็อก thanachart.org
IMC Institute
 
กลยุทธ์ 5 ด้านกับการทำ Digital Transformation
กลยุทธ์ 5 ด้านกับการทำ Digital Transformationกลยุทธ์ 5 ด้านกับการทำ Digital Transformation
กลยุทธ์ 5 ด้านกับการทำ Digital Transformation
IMC Institute
 

More from IMC Institute (20)

นิตยสาร Digital Trends ฉบับที่ 14
นิตยสาร Digital Trends ฉบับที่ 14นิตยสาร Digital Trends ฉบับที่ 14
นิตยสาร Digital Trends ฉบับที่ 14
 
Digital trends Vol 4 No. 13 Sep-Dec 2019
Digital trends Vol 4 No. 13  Sep-Dec 2019Digital trends Vol 4 No. 13  Sep-Dec 2019
Digital trends Vol 4 No. 13 Sep-Dec 2019
 
บทความ The evolution of AI
บทความ The evolution of AIบทความ The evolution of AI
บทความ The evolution of AI
 
IT Trends eMagazine Vol 4. No.12
IT Trends eMagazine  Vol 4. No.12IT Trends eMagazine  Vol 4. No.12
IT Trends eMagazine Vol 4. No.12
 
เพราะเหตุใด Digitization ไม่ตอบโจทย์ Digital Transformation
เพราะเหตุใด Digitization ไม่ตอบโจทย์ Digital Transformationเพราะเหตุใด Digitization ไม่ตอบโจทย์ Digital Transformation
เพราะเหตุใด Digitization ไม่ตอบโจทย์ Digital Transformation
 
IT Trends 2019: Putting Digital Transformation to Work
IT Trends 2019: Putting Digital Transformation to WorkIT Trends 2019: Putting Digital Transformation to Work
IT Trends 2019: Putting Digital Transformation to Work
 
มูลค่าตลาดดิจิทัลไทย 3 อุตสาหกรรม
มูลค่าตลาดดิจิทัลไทย 3 อุตสาหกรรมมูลค่าตลาดดิจิทัลไทย 3 อุตสาหกรรม
มูลค่าตลาดดิจิทัลไทย 3 อุตสาหกรรม
 
IT Trends eMagazine Vol 4. No.11
IT Trends eMagazine  Vol 4. No.11IT Trends eMagazine  Vol 4. No.11
IT Trends eMagazine Vol 4. No.11
 
แนวทางการทำ Digital transformation
แนวทางการทำ Digital transformationแนวทางการทำ Digital transformation
แนวทางการทำ Digital transformation
 
บทความ The New Silicon Valley
บทความ The New Silicon Valleyบทความ The New Silicon Valley
บทความ The New Silicon Valley
 
นิตยสาร IT Trends ของ IMC Institute ฉบับที่ 10
นิตยสาร IT Trends ของ  IMC Institute  ฉบับที่ 10นิตยสาร IT Trends ของ  IMC Institute  ฉบับที่ 10
นิตยสาร IT Trends ของ IMC Institute ฉบับที่ 10
 
แนวทางการทำ Digital transformation
แนวทางการทำ Digital transformationแนวทางการทำ Digital transformation
แนวทางการทำ Digital transformation
 
The Power of Big Data for a new economy (Sample)
The Power of Big Data for a new economy (Sample)The Power of Big Data for a new economy (Sample)
The Power of Big Data for a new economy (Sample)
 
บทความ Robotics แนวโน้มใหม่สู่บริการเฉพาะทาง
บทความ Robotics แนวโน้มใหม่สู่บริการเฉพาะทาง บทความ Robotics แนวโน้มใหม่สู่บริการเฉพาะทาง
บทความ Robotics แนวโน้มใหม่สู่บริการเฉพาะทาง
 
IT Trends eMagazine Vol 3. No.9
IT Trends eMagazine  Vol 3. No.9 IT Trends eMagazine  Vol 3. No.9
IT Trends eMagazine Vol 3. No.9
 
Thailand software & software market survey 2016
Thailand software & software market survey 2016Thailand software & software market survey 2016
Thailand software & software market survey 2016
 
Developing Business Blockchain Applications on Hyperledger
Developing Business  Blockchain Applications on Hyperledger Developing Business  Blockchain Applications on Hyperledger
Developing Business Blockchain Applications on Hyperledger
 
Digital transformation @thanachart.org
Digital transformation @thanachart.orgDigital transformation @thanachart.org
Digital transformation @thanachart.org
 
บทความ Big Data จากบล็อก thanachart.org
บทความ Big Data จากบล็อก thanachart.orgบทความ Big Data จากบล็อก thanachart.org
บทความ Big Data จากบล็อก thanachart.org
 
กลยุทธ์ 5 ด้านกับการทำ Digital Transformation
กลยุทธ์ 5 ด้านกับการทำ Digital Transformationกลยุทธ์ 5 ด้านกับการทำ Digital Transformation
กลยุทธ์ 5 ด้านกับการทำ Digital Transformation
 

Recently uploaded

Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 

Recently uploaded (20)

Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 

Introduction to Big Data

  • 1. Introduction to Big Data Dr. Putchong Uthayopas Department of Computer Engineering, Faculty of Engineering, Kasetsart University Email: putchong@ku.th
  • 2. We  are  living  in  the  world  of  Data   Geophysical Exploration Medical Imaging Video Surveillance Mobile Sensors Gene Sequencing Smart Grids Social Media
  • 3.
  • 4.
  • 5.
  • 6.
  • 7. Big data is high-volume, high-velocity and high- variety information assets that demand cost- effective, innovative forms of information processing for enhanced insight and decision making. “Gartner Inc.”
  • 8.
  • 9.
  • 10. Why  BigData?   • Improve  product  and   service   • Increase  customer   sa<sfac<on/behavior   • Improve  opera<on   efficiency   • Understand   emerging  market   trends       The  real  value   of  big  data  is  in   the  insights  it     produces  when   analyzed— discovered   paEerns,   derived   meaning,   indicators  for   decisions,  and     ul<mately  the   ability  to   respond  to  the   world    with   greater   intelligence.     Know thy self, know thy enemy. A thousand battles, a thousand victories. h#p://www.intel.com/content/dam/www/public/us/en/ documents/product-­‐briefs/big-­‐data-­‐cloud-­‐technologies-­‐ brief.pdf  )  
  • 11. Source:  The  field  guide  to  Data  Science  
  • 12. Big  Data  vs  Business  Intelligent  vs.   Analy<cs   •  BI  soLware  and  technology   – Well  structure  data  from  warehouse   – Visual  Representa<on  of  data  to  gain  insight  into   data   –   Some  predic<ve  capability  such  as  sta<s<cal   analysis  ,  Data  mining   •  Big  Data   – Focus  on  analysis  of  huge  and  unstructured  data   set  to  gain  insight  informa<on  automa<cally  
  • 13. Property  of  Big  Data   BIG  Data   Volume   Velocity   Variety  
  • 14. Volume   •  Big  data  must  be   huge     – Beyond  the   capability  of  a  single   computer  server  to   process  it     – Possible  to  store  the   data  but  difficult  to   process  it  
  • 15. Velocity   •  Big  data  accumulate  at  a   very  fast  speed   –  Stock  market  data   –  Internet    access  log   –  Social  media  data   •  TwiEer  ,  facebook,  IG     •  We  need  to   –  Extract  meaning  as  fast  and   as  much  as  we  can  before   throwing  away  the  data    
  • 16. Variety   •  Data  come  with   variety   –  Tradi<onal  data   base   –  Documents   –  Web  page   –  Social  media   data   –  Image   –  Video/Audio   –  Loca<on  
  • 17.  Diya  Soubra,    The  3Vs  that  define  Big  Data,  2012   hEp://www.datasciencecentral.com/forum/topics/the-­‐3vs-­‐that-­‐define-­‐big-­‐data  
  • 18. Considera<on  for  Applying  Big   Data   hEp://fredericgonzalo.com/en/2013/07/07/big-­‐data-­‐in-­‐tourism-­‐hospitality-­‐4-­‐key-­‐components/  
  • 19. BRIEF  OVERVIEW  OF  BIG  DATA   TOOLS
  • 20. Big  Data  Ecosystem   Reference:  hEp://dataconomy.com/understanding-­‐big-­‐data-­‐ecosystem/  
  • 21. Big  Data  Eco  system-­‐  Infrastructure   •  Hadoop-­‐   –  technologies  designed  for  the  storing,  processing  and  analysing   of  data  by  breaking  up  and  distribu<ng  data  into  parts  and   analysing  those  parts  concurrently,  rather  than  tackling  one   monolithic  block  of  data  all  in  one  go.   •  NoSQL   –  Stands  for  Not  Only  SQL   –  involved  in  processing  large  volumes  of  mul<-­‐structured  data.   Most  NoSQL  databases  are  most  adept  at  handling  discrete  data   stored  among  mul<-­‐structured  data.     •  Massively  Parallel  Processing  (MPP)  Databases   –  MPP  databases  work  by  segmen<ng  data  across  mul<ple  nodes,   and  processing  these  segments  of  data  in  parallel,  and  uses  SQL.     Reference:  hEp://dataconomy.com/understanding-­‐big-­‐data-­‐ecosystem/  
  • 22. Big  Data  Eco  system-­‐  Analy<cs   •  AnalyHcs  PlaIorms   –  Integrate  and  analyse  data  to  uncover  new  insights,  and  help  companies  make  beEer-­‐ informed  decisions.     •  VisualizaHon  PlaIorms     –   visualizing  data;  taking  the  raw  data  and  presen<ng  it  in  complex,  mul<-­‐dimensional   visual  formats  to  illuminate  the  informa<on   •  Business  Intelligence  (BI)  PlaIorms   –  analyze  data  from  mul<ple  sources  to  deliver  services  such  as  business  intelligence   reports,  dashboards  and  visualiza<ons   •  Machine  Learning   –  machine  learning  is  data  the  algorithm  ‘learns  from’,  and  the  output  depends  on  the  use   case.  One  of  the  most  famous  examples  is  IBM’s  super  computer  Watson,  which  has   ‘learned’  to  scan  vast  amounts  of  informa<on  to  find  specific  answers,  and  can  comb   through  200  million  pages  of  structured  and  unstructured  data  in  minutes.     Reference:  hEp://dataconomy.com/understanding-­‐big-­‐data-­‐ecosystem/  
  • 23. How  can  we  store  and  process  massive   data   •  Beyond  capability  of  a  single  server   •  Basic  Infrastructure   –  Cluster  of  servers   –  High  speed  interconnected   –  High  speed  storage  cluster   •  Incoming  data  will  be  spread  across  the  server  farm   •  Processing  is  quickly  distributed  to  the  farm   •  Result  is  collected  and  send  back  
  • 24. NoSQL  (Not  Only  SQL)   •  A  NoSQL  (oLen  interpreted  as  Not  only  SQL)   database  provides  a  mechanism  for  storage  and   retrieval  of  data  that  is  modeled  in  means  other  than   the  tabular  rela<ons  used  in  rela<onal  databases.     – being  non-­‐relaHonal,  distributed,  open-­‐ source  and  horizontally  scalable.   – Used  to  handle  a  huge  amount  of  data     – The  original  inten<on  has  been  modern  web-­‐scale   databases.     Reference:  hEp://nosql-­‐database.org/  
  • 25. •  MongoDB  is  a  general  purpose,   open-­‐source  database.     •  MongoDB  features:   –  Document  data  model  with   dynamic  schemas   –  Full,  flexible  index  support  and  rich   queries   –  Auto-­‐Sharding    for  horizontal   scalability   –  Built-­‐in  replica<on  for  high   availability   –  Text  search   –  Advanced  security  
  • 26. •  Hadoop  is  an  open-­‐source  soLware  framework  wriEen  in  Java  for   distributed  storage  and  distributed  processing  of  very  large  data  sets  on   computer  clusters  built  from  commodity  hardware.     •  The  base  Apache  Hadoop  framework  is  composed  of  the  following   modules:   –  Hadoop  Common  –  contains  libraries  and  u<li<es  needed  by  other  Hadoop   modules;   –  Hadoop  Distributed  File  System  (HDFS)  –  a  distributed  file-­‐system  that  stores   data  on  commodity  machines,  providing  very  high  aggregate  bandwidth   across  the  cluster;   –  Hadoop  YARN  –  a  resource-­‐management  plakorm  responsible  for  managing   compute  resources  in  clusters  and  using  them  for  scheduling  of  users'   applica<ons;and   –  Hadoop  MapReduce  –  a  programming  model  for  large  scale  data  processing.   •  Hadoop  was  created  by  Doug  Cumng  and  Mike  Cafarella  in  2005.  Cumng,   who  was  working  at  Yahoo!  at  the  <me,  named  it  aLer  his  son's  toy   elephant.  
  • 27. Magic  behind  Hadoop  and  HDFS   •  Problem  is  divided  into  two  phases   –  Map  applying  some  ac<on  to  data  in  <key,  Value>   Pair  and  get  some  intermediate  results   –  Reduce  summarize  intermediate  result  <key,value>   and  return  back  to  main  program   Ricky  Ho,  How  Hadoop  Map/Reduce  works,     hEp://architects.dzone.com/ar<cles/how-­‐hadoop-­‐mapreduce-­‐works  
  • 28. Example:  Word  count   •  Coun<ng  word  in  an  input  text  file.   –  How  many  word  “love”  in  a  novel?  ^_^   •  In  map  phase  the  sentence  would  be  split  as  words  and   form  the  ini<al  key  value  pair  <word,  1>   •  “tring  tring  the  phone  rings”  becomes  <tring,1>  ,<tring,1>,  <the,1>,   <phone,1>,  <rings,1>   –  In  the  reduce  phase  the  keys  are  grouped  together  and  the  values   for  similar  keys  are  added.     •  There  are  only  one  pair  of  similar  keys  ‘tring’  the  values  for  these  keys   would  be  added  so  the  out  put  key  value  pairs  would  be   •  <tring,2>,  <the,1>,  <phone,1>,  <rings,1>   •  Reduce  forms  an  aggrega<on  phase  for  keys     –  This  would  give  the  number  of  occurrence  of  each  word  in  the   input.     hEp://kickstarthadoop.blogspot.com/2011/04/word-­‐count-­‐hadoop-­‐map-­‐reduce-­‐ example.html  
  • 29. In-­‐memory  Database   •  An  in-­‐memory  database  is     –  a  database  management  system  that   primarily  relies  on  main   memory  for  computer  data  storage.     –  faster  than  disk-­‐op<mized  databases   since  the  internal  op<miza<on   algorithms  are  simpler  and  execute   fewer  CPU  instruc<ons.   –   Accessing  data  in  memory   eliminates  seek  <me  when  querying   the  data,  which  provides  faster  and   more  predictable  performance  than   disk.   Source:  hEp://en.wikipedia.org/wiki/In-­‐memory_database  
  • 30.
  • 31. What  is  Spark?   Efficient   •  General  execu<on  graphs   •  In-­‐memory  storage   Usable   •  Rich  APIs  in  Java,  Scala,   Python   •  Interac<ve  shell   Fast and Expressive Cluster Computing ! Engine Compatible with Apache Hadoop 2-­‐5×  less  code   Up  to  10×  faster  on  disk,  100×  in  memory  
  • 33. Spark  at  Yahoo   •  Personalizing  news  pages  for  Web  visitors  and   another  for  running  analy<cs  for  adver<sing.   For  news  personaliza<on,  the  company  uses   ML  algorithms  running  on  Spark  to  figure  out   what  individual  users  are  interested  in,  and   also  to  categorize  news  stories  as  they  arise  to   figure  out  what  types  of  users  would  be   interested  in  reading  them.   –  wrote  a  Spark  ML  algorithm  120  lines  of  Scala.   (Previously,  its  ML  algorithm  for  news   personaliza<on  was  wriEen  in  15,000  lines  of  C++.)   –  With  just  30  minutes  of  training  on  a  large,  hundred   million  record  data  set,  the  Scala  ML  algorithm  was   ready  for  business.   •  Second  use  case  shows  off  Hive  on  Spark   (Shark’s)  interac<ve  capability.     –  use  exis<ng  BI  tools  to  view  and  query  their   adver<sing  analy<c  data  collected  in  Hadoop.     hEp://www.datanami.com/2014/03/06/apache_spark_3_real-­‐ world_use_cases/  
  • 34. BigData  Goes  to  Cloud   •  Data  is  already  on  the  cloud   – Virtual  organiza<on   – Cloud  based  SaaS  Service   •  Big  Data  As  a  Service  on  the  Cloud   – Private  Cloud   – Public  Cloud  
  • 35. Amazon •  Amazon  EC2   – Computa<on  Service  using  VM   •  Amazon  DynamoDB   – Large  scalable  NoSQL  databased   – Fully  distributed  shared  nothing  architecture   •  Amazon  Elas<c  MapReduce  (Amazon  EMR)   – Hadoop  based  analysis  engine   – Can  be  used  to  analyse  big  data  without  the   need  to  build  the  infrastucture   hEp://aws.amazon.com/big-­‐data/  
  • 36. Google  Cloud  Plakorm •  App  engines     –  mobile  and  web  app   •  Cloud  SQL   –  MySQL  on  the  cloud   •  Cloud  Storage   –  Data  storage   •  Big  Query   –  Data  analysis   •  Google  Compute  Engine   –  Processing  of  large  data
  • 37.
  • 38. BIG  DATA  BENEFIT  AND  USE  CASE
  • 39. Current  Trends   •  Big  data  toward  real   usage   –  From  pilot  to  real  usage   •  More  soLware  solu<on   –  Infrastructure   –  Analy<cs   •  Sta<s<cal  Analysis   •  Social  Graph  Analysis   •  More  unstructured  data   –  Facebook  ,  twiEer,  text  ,   video,  image     Analy<cs   Structured   Unstructured   Big  Data  
  • 40. Google  Flu   •  paEern  emerges  when  all  the  flu-­‐ related  search  queries  are  added   together.     •  We  compared  our  query  counts  with   tradi<onal  flu  surveillance  systems   and  found  that  many  search  queries   tend  to  be  popular  exactly  when  flu   season  is  happening.     •  By  coun<ng  how  oLen  we  see  these   search  queries,  we  can  es<mate  how   much  flu  is  circula<ng  in  different   countries  and  regions  around  the   world.     hEp://www.google.org/flutrends/ about/how.html  
  • 41. WHAT  FACEBOOK  KNOWS   hEp://www.facebook.com/data   Cameron  Marlow  calls  himself  Facebook's  "in-­‐ house  sociologist."  He  and  his  team  can  analyze   essen<ally  all  the  informa<on  the  site  gathers.  
  • 42. Study  of  Human  Society   •  Facebook,  in  collabora<on  with  the  University   of  Milan,  conducted  experiment  that  involved     – the  en<re  social  network  as  of  May  2011   – more  than  10  percent  of  the  world's  popula<on.     •  Analyzing  the  69  billion  friend  connec<ons   among  those  721  million  people  showed  that     – four  intermediary  friends  are  usually  enough  to   introduce  anyone  to  a  random  stranger.    
  • 43. Why?   •  Facebook  can  improve  users  experience       – make  useful  predic<ons  about  users'  behavior   – make  beEer  guesses  about  which  ads  you  might   be  more  or  less  open  to  at  any  given  <me   •  Right  before  Valen<ne's  Day  this  year  a   blog  post  from  the  Data  Science  Team  listed   the  songs  most  popular  with  people  who  had   recently  signaled  on  Facebook  that  they  had   entered  or  leL  a  rela<onship  
  • 44. How  facebook  handle  Big  Data?   •  Facebook  built  its  data  storage  system  using  open-­‐ source  soLware  called  Hadoop.   –  Hadoop  spreading  them  across  many  machines  inside  a   data  center.   –  Use  Hive,  open-­‐source  that  acts  as  a  transla<on  service,   making  it  possible  to  query  vast  Hadoop  data  stores  using   rela<vely  simple  code.   •  Much  of  Facebook's  data  resides  in  one  Hadoop  store   more  than  100  petabytes  (a  million  gigabytes)  in  size,   says  Sameet  Agarwal,  a  director  of  engineering  at   Facebook  who  works  on  data  infrastructure,  and  the   quan<ty  is  growing  exponen<ally.  "Over  the  last  few   years  we  have  more  than  doubled  in  size  every  year,”  
  • 45. eBay     •  eBay  is  using  Hadoop  technology  and  the  Hbase  database,  which  supports  real-­‐ <me  analysis  of  Hadoop  data,  to  build  a  new  search  engine  for  its  auc<on  site.   –  97  million  ac<ve  buyers  and  sellers     –  over  200  million  items  for  sale  in  50,000  categories.     –  The  site  handles  close  to  2  billion  page  views.   –   250  million  search  queries  and  tens  of  billions  of  database  calls  daily.   •  The  company  has  9  petabytes  of  data  stored  on  Hadoop  and  Teradata  clusters,   and  the  amount  is  growing  quickly,  he  said.   •  100  eBay  engineers  are  working  on  the  Cassini  project.  The  new  engine  is   expected  to  respond  to  user  queries  with  results  that  are  context-­‐based  and  more   accurate  than  those  provided  by  the  current  system.   Source:  hEp://www.computerworld.com/ar<cle/2550078/data-­‐center/hadoop-­‐is-­‐ready-­‐for-­‐the-­‐enterprise-­‐-­‐it-­‐execs-­‐say.html  
  • 46. •  JPMorgan  Chase  s<ll  relies  heavily  on  rela<onal   database  systems  for  transac<on  processing.   •  Hadoop  technology  is  used  for  a  growing  number  of   purposes,  including  fraud  detecGon,  IT  risk   management  and  self  service.   –  With  over  150  petabytes  of  data  stored  online,  30,000   databases  and  3.5  billion  log-­‐ins  to  user  accounts.   •  Hadoop's  ability  to  store  vast  volumes  of  unstructured   data  allows  the  company  to  collect  and  store  Web   logs,  transac<on  data  and  social  media  data.   •  The  data  is  aggregated  into  a  common  plakorm  for   use  in  a  range  of  customer-­‐focused  data  mining  and   data  analy<cs  tools.   Source:  hEp://www.computerworld.com/ar<cle/2550078/data-­‐center/hadoop-­‐is-­‐ready-­‐for-­‐the-­‐enterprise-­‐-­‐it-­‐execs-­‐say.html  
  • 47. Premier   •  Premier,  the  U.S.  healthcare  alliance  network.  More   than  2,700  members,  hospitals  and  health  systems, 90,000  non-­‐acute  facili<es  and  400,000  physicians     –  a  large  database  of  clinical,  financial,  pa<ent,and  supply   chain  data   –  generated  comprehensive  and  comparable  clinical   outcome  measures,  resource  u<liza<on  reports  and   transac<on  level  cost  data.     •  Big  data  is  used  to  improve  the  healthcare  processes  at   approximately  330  hospitals,  saving  an  es<mated   29,000  lives  and  reducing  healthcare  spending  by   nearly  $7  billion   Reference:  IBM:  Data  Driven  Healthcare  Organiza<ons  Use  Big  Data  Analy<cs  for  Big   Gains;  2013.  hEp://www03.ibm.com/industries/ca/en/healthcare/   documents/Data_driven_healthcare_organiza<ons_use_big_data_analy<cs_for_big_gains.pdf.  
  • 48. Some  Sucesss   •  The  Rizzoli  Orthopedic  Ins<tute  in  Bologna,   Italy     – using  advanced  analy<cs  to  gain  a  more  “granular   understanding”  of  the  clinical  varia<ons  within   families  whereby  individual  pa<ents  display   extreme  differences  in  the  severity  of  their   symptoms.     •  The  insight  is  reported  to  have  reduced  annual   hospitaliza<ons  by  30%  and  the  number  of   imaging  tests  by  60%.  
  • 49. Social  Media  Analy<cs   •  Social  media  analyHcs  is  the  prac<ce  of   gathering  data  from  blogs  and  social   media  websites  and  analyzing  that  data  to   make  business  decisions.  The  most  common   use  of  social  media  analyHcs  is  to  mine   customer  sen<ment  in  order  to  support   marke<ng  and  customer  service  ac<vi<es.   What  is  social  media  analy<cs?  -­‐  Defini<on  from  WhatIs.com  
  • 50.
  • 51.
  • 52. Star<ng  a  Big  Data  Ini<a<ve   Data   Infrastructure   Big  Data  Tools   Analy<cs  SoLware   Visualiza<on   Top  Down   BoEom  Up  
  • 53. Data  Product   •  Data  Product  provides  ac<onalble  informa<on   without  exposing  decision  maker  to  the   underlying  data  or  analy<cs   – Movie  Recommenda<ons   – Weather  Forecast   – Stock  Market  Predic<on   – Opera<on  improvement   – Health  Diagnosis   – Targeted  Adver<sing    
  • 54. Source:  The  Filed  Guide  to  Data  Science,  Booz,  Allen,  Hamilton  
  • 55. BoEom  up  approach   •  What  is  the  data  that  we  have?   •  How  can  we  collect  and  store  it?   •  What  is  the  infrastructure  and   tool  to  process  this  big  data?   •  What  analy<cs  method  can  be   apply?   •  What  is  the  insight  we  can  gain   from  this  data  and  analysis?  
  • 56. Top  down   •  What  is  the  business   challenge  that  can  create   value  and  impact  to  the   organiza<on?   •  What  is  the  data  that  we   need?   •  What  is  the  tools  and  analy<cs   approach  that  should  be   used  ?   •  What  is  the  infrastructure   needed?    
  • 57. Some  thought   •  BoEom  up  approach  may  be  good  when  you  do  not  know   how  to  start?   •  Pick  some  easy  ques<on  and  start  a  pilot   –  Learning  infrastructure  technology,  analy<c  technology  and   tools   –  Using  data  you  already  have     •  Top  down  that  focus  on  business  value  is  beEer  but   challenging   –  Hard  to  ask  a  good  ques<on,  need  management  to  iden<fy  the   need     –  May  have  to  ask  many  ques<ons  and  pick  the  right  one  based   on   •  Impact  and  value   •     
  • 58. Example:  What  is/is  not    big  data   problem?   •  I  want  to  classify  the  legal  documents  to  make   it  easy  to  process  these  documents   •  I  want  to  learn  how  our  customer  react  to  our   new  Tee-­‐shirt   •  I  want  to  understand  how  our  students  use   facebook  
  • 59.
  • 60.
  • 62. Trend:    Informa<on  Tsunami  is  coming!  
  • 63. Informa<on  Tsunami   •  Rapid  expansion  of  Smartphone  Usage,  social   compu<ng,  mobile  applica<on,  gaming   •  Rapid  increases  in  Network  Bandwidth  and  coverage   –  Wifi,  4G     •  Rapid  move  toward  Internet  of  Things  (IOT)   –  Sensor  everywhere,  mul<media  informa<on  
  • 64. Trend:    Big  data  infrastructure   becomes  even  more  powerful  and   easy  to  use    
  • 65. In-­‐memory  Database   •  An  in-­‐memory  database  is     –  a  database  management  system  that   primarily  relies  on  main   memory  for  computer  data  storage.     –  faster  than  disk-­‐op<mized  databases   since  the  internal  op<miza<on   algorithms  are  simpler  and  execute   fewer  CPU  instruc<ons.   –   Accessing  data  in  memory   eliminates  seek  <me  when  querying   the  data,  which  provides  faster  and   more  predictable  performance  than   disk.   Source:  hEp://en.wikipedia.org/wiki/In-­‐memory_database  
  • 66.
  • 67. What  is  Spark?   Efficient   •  General  execu<on  graphs   •  In-­‐memory  storage   Usable   •  Rich  APIs  in  Java,  Scala,   Python   •  Interac<ve  shell   Fast and Expressive Cluster Computing ! Engine Compatible with Apache Hadoop 2-­‐5×  less  code   Up  to  10×  faster  on  disk,  100×  in  memory  
  • 68. Spark  at  Yahoo   •  Personalizing  news  pages  for  Web  visitors  and   another  for  running  analy<cs  for  adver<sing.   For  news  personaliza<on,  the  company  uses   ML  algorithms  running  on  Spark  to  figure  out   what  individual  users  are  interested  in,  and   also  to  categorize  news  stories  as  they  arise  to   figure  out  what  types  of  users  would  be   interested  in  reading  them.   –  wrote  a  Spark  ML  algorithm  120  lines  of  Scala.   (Previously,  its  ML  algorithm  for  news   personaliza<on  was  wriEen  in  15,000  lines  of  C++.)   –  With  just  30  minutes  of  training  on  a  large,  hundred   million  record  data  set,  the  Scala  ML  algorithm  was   ready  for  business.   •  Second  use  case  shows  off  Hive  on  Spark   (Shark’s)  interac<ve  capability.     –  use  exis<ng  BI  tools  to  view  and  query  their   adver<sing  analy<c  data  collected  in  Hadoop.     hEp://www.datanami.com/2014/03/06/apache_spark_3_real-­‐ world_use_cases/  
  • 69. BigData  Infrastructure  Goes  to  Cloud   •  Data  is  already  on  the  cloud   –  Virtual  organiza<on   –  Cloud  based  SaaS  Service   •  Big  Data  As  a  Service  on  the  Cloud   –  Private  Cloud   –  Public  Cloud   •  IBM  Bluemix,  Amazon  AWS  (EMR)  and  many     Big  Data   Services   Services   App   App  
  • 70. Trend:    Big  data  is  moving   toward  the  real  usage    
  • 71. Trends   •  Big  data  toward  real  usage   –  From  pilot  to  real  usage   •  More  soLware  solu<on   –  Infrastructure   –  Analy<cs   •  Sta<s<cal  Analysis   •  Social  Graph  Analysis  with   machine  learning   •  More  unstructured  data   –  Facebook  ,  twiEer,  text  ,   video,  image     Analy<cs   Structured   Unstructured   Big  Data  
  • 72. Trend  :  much  smarter  data  analy<cs   is  coming      
  • 73. Big  Data  Analy<cs   •  a  set  of  advanced  technologies   designed  to  work  with  large   volumes  of  heterogeneous  data.     •  explore  the  data  and  to  discover   interrela<onships  and  paEerns   using    sophis<cated  quan<ta<ve   methods  such  as     •  machine  learning   •  neural  networks   •  robo<cs  algorithm     •  computa<onal  mathema<cs   •  ar<ficial  intelligence    
  • 74. Deep  Learning   •  Deep  learning  is  a  subcategory  of  machine  learning   with  the  use  of  neural  networks  to  improve  things   like  speech  recogni<on,  computer  vision,   and  natural  language  processing.     –  Unsupervised  learning  for  abstract  concept  
  • 75. Applying  Deep  Learning   •  In  2011,  Stanford  computer  science  professor  Andrew  Ng  founded  Google’s  Google   Brain  project,  which  created  a  neural  network  trained  with  deep  learning   algorithms,  which  famously  proved  capable  ofrecognizing  high  level  concepts,  such   as  cats,  aLer  watching  just  YouTube  videos-­‐-­‐and  without  ever  having  been  told   what  a  “cat”  is.   •  Facebook  using  deep  learning  exper<se  to  help  create  solu<ons  that  will  beEer   iden<fy  faces  and  objects  in  the  350  million  photos  and  videos  uploaded  to   Facebook  each  day.   •  Voice  recogni<on  like  Google  Now  and  Apple’s  Siri  is  now  using  deep  learning.   –  According  to  Google  researchers,  the  voice  error  rate  in  the  new  version  of  Android-­‐-­‐aLer   adding  insights  from  deep  learning-­‐-­‐stands  at  25%  lower  than  previous  versions  of  the   soLware.     Source:  h#p://www.fastcolabs.com/3026423/why-­‐google-­‐is-­‐invesGng-­‐in-­‐deep-­‐learning   h#p://www.wired.com/2014/08/deep-­‐learning-­‐yann-­‐lecun/  
  • 76. IBM  Watson  and  Cogni<ve  Technology   •  Watson  is  a  cogni<ve   technology  that  processes   informa<on  more  like  a  human  than   a  computer—by  understanding   natural  language,  genera<ng   hypotheses  based  on  evidence,  and   learning  as  it  goes.  And  learn  it  does.     •  Watson  “gets  smarter”  in  three   ways:     –  being  taught  by  its  users   –   learning  from  prior  interac<ons   –  being  presented  with  new  informa<on.     •  This  means  organiza<ons  can  more   fully  understand  and  use  the  data   that  surrounds  them,  and  use  that   data  to  make  beEer  decisions.  
  • 77. Applying  Watson  in  Healthcare   •  WellPoint,  Inc.  is  an  Indianapolis-­‐based  health   benefits  company.     –  approximately  37  million  health  plan  members     –  processes  more  than    550  million  claims  per  year.     •  Using  IBM  Watson™  to  improve  the  quality    and   efficiency  of  healthcare  decisions.   –  WellPoint  trained  Watson  with  25,000  historical   cases.  Now  Watson  uses  hypothesis  genera<on  and   evidence-­‐based  learning  to  generate  confidence-­‐ scored  recommenda<ons  that  help  nurses  make   decisions  about  u<liza<on  management.  Natural   language  processing  leverages  unstructured  data,   such  as  text-­‐based  Treatment  requests.     •  Benefit   –  Helps  UM  nurses  make  faster  UM  decisions  about   treatment  requests   –  Could  accelerate  healthcare  preapprovals,  which  can   be  cri<cal  when  treatments  are  <me-­‐sensi<ve   –  Includes  unstructured  data  in  the  streamlined   decision  process  
  • 78. Challenges   •  Developing  Big  Data  Applica<on  is  not  simple   – New  algorithm,  new  soLware  development  tools     •  Proper  policy  about  data  security  and   ownership   •  Lack  of  Data  Scien<sts   – Different  from  SoLware  Developer    
  • 79. Have  fun  with  your  Big  Data   Advanture!