SlideShare a Scribd company logo
 



                       	
  
    Applying	
  Big	
  Data	
  Analy-cs.	
  	
  

Analyzing Multi-Structured
    Data with Hadoop
                  Justin Borgman
                 CEO & Co-Founder
Company	
  Profile	
  




•       30	
  people,	
  based	
  in	
  Cambridge,	
  MA	
  
•       Founded	
  in	
  July,	
  2010	
  
•       Raised	
  $9.5M	
  Series	
  A	
  from	
  Bessemer	
  
        and	
  Norwest	
                                          •      CEO	
  &	
  Co-­‐Founder	
  
•       Based	
  on	
  the	
  HadoopDB	
  research	
              •      Previously	
  spent	
  7	
  years	
  as	
  a	
  
        project	
  in	
  the	
  Yale	
  Computer	
  Science	
            soAware	
  developer	
  at	
  MIT	
  
        Department	
  by	
  Daniel	
  Abadi,	
  et.	
  al.	
             Lincoln	
  Laboratory	
  and	
  product	
  
                                                                         manager	
  at	
  startup	
  Covectra	
  
	
  
                                                                  •      Undergrad:	
  UMass	
  Amherst	
  
	
  
                                                                  •      Grad:	
  Yale	
  University	
  
                                                                  	
  
                                                                                                                            2
Big	
  Data:	
  Volume	
  |	
  Variety|	
  Velocity	
  |	
  VALUE	
  




                                                          Source:	
  wikibon.org	
  
Big	
  Data	
  in	
  the	
  Headlines	
  




                                                      “How Target Figured Out A Teen
                                                      Girl Was Pregnant Before Her
  “Digital universe” grew by 62% last year to 800K    Father Did”
  petabytes & will grow to 1.2 zettabytes this year

                                                      “Why Netflix produces BBC
                                                      remake starring Kevin Spacey,
                                                      directed by David Fincher”      4
The	
  Big	
  Data	
  Ecosystem	
  
Example:	
  Big	
  Data	
  Analysis	
  Process	
  

                              HADOOP                                 MPP DBMS

Raw Data          load
                                                    extract
                                                    Aggregate
                                                    Sample
                                                    Filter




                                                                          predict
Web access logs
Click logs
Impressions
Email                    Term extraction
Tweets                   Entity extraction
Sensor data              Sentiment analysis
Documents                Geocoding
                         Cleanse
                         Sessionization
                         Join                 Applications
                                                                                                    BI Tools

                                                                Predictive analytics


                                                                                       Business	
  Analyst	
  
Example:	
  Hadapt	
  Analysis	
  Process	
  


   Raw Data        load




                                       predict



              Applications
                                                    BI Tools

                             Predictive analytics
The	
  Evolu-on	
  of	
  Analy-cs	
  –	
  Where	
  are	
  we	
  today?	
  	
  	
  	
  

The	
  early	
  stages	
  of	
  analy-cs	
  	
  
•    Market	
  Basket	
  Analysis	
  
•    Trend	
  Analysis	
  
•    Cyclical	
  Analysis	
  
•    Customer	
  Segmenta-on	
  
New	
  Analy-cal	
  Models	
  
•    Pacern	
  Detec-on,	
  Discovery,	
  Matching	
  
•    A/B	
  Tes-ng	
  and	
  Behavioral	
  Analysis	
  
•    Sessioniza-on	
  
•    Social	
  Correla-on	
  Analysis	
  	
  
•    Frac-onal	
  Acribu-on	
  
•    Sen-ment	
  Analysis	
  	
  
•    Personaliza-on	
  	
  


                                                                                         8
Big	
  Data	
  in	
  Ac-on	
  	
  
•  Amazon	
  and	
  Ne)lix	
  engage	
  in	
  arbitrage	
  on	
  video	
  content	
  based	
  on	
  customer	
  behavior	
  

•  Harvard	
  predicts	
  the	
  spread	
  of	
  cholera	
  in	
  Hai-,	
  and	
  Derwent	
  Capital	
  out-­‐trades	
  the	
  market	
  based	
  
    on	
  tweets	
  and	
  their	
  sen-ment	
  

•  En-re	
  ecosystems	
  were	
  shotgun	
  gene	
  sequenced	
  by	
  Celera.	
  	
  

•  Life	
  events	
  are	
  predicted	
  by	
  Target	
  and	
  marketed	
  accordingly	
  

•  *Osco	
  Drug	
  increased	
  sales	
  by	
  op-mizing	
  product	
  placement,	
  e.g.	
  beer	
  and	
  diapers	
  

•  Ads	
  are	
  op-mally	
  placed	
  and	
  priced	
  *for	
  you*	
  by	
  DataXu	
  in	
  real	
  -me	
  

•  Next	
  Big	
  Sound	
  predicts	
  new	
  ar-sts	
  and	
  hits	
  based	
  on	
  signals	
  from	
  social	
  media	
  

•  Real-­‐-me	
  produc-on	
  op-miza-on	
  saves	
  Chevron	
  over	
  $1B/year	
  

•  Retailer	
  web	
  sites	
  are	
  re-­‐organized	
  and	
  re-­‐op-mized	
  for	
  content	
  by	
  Bloomreach	
  

•  LinkedIn	
  suggests	
  who	
  you	
  might	
  know,	
  eHarmony	
  suggests	
  who	
  you	
  might	
  love	
  
Example:	
  POS	
  Data	
  Insights	
  




                                          10
Example:	
  e-­‐Tailer	
  
 Business	
  Opportunity	
  
 •  Should	
  I	
  run	
  a	
  promo-on	
  among	
  the	
  Lady	
  Gaga	
  fans	
  or	
  Jus-n	
  Bieber	
  fans?	
  
 •  Based	
  on	
  shopping	
  cart	
  and	
  browsing/purchase	
  history,	
  what	
  other	
  
    products	
  should	
  be	
  recommended	
  before	
  the	
  customer	
  checks	
  out?	
  
 •  Which	
  items	
  are	
  oAen	
  purchased	
  together,	
  and	
  any	
  correla:on	
  with	
  
    shopping	
  date/-me,	
  customer	
  age,	
  gender,	
  etc?	
  
 Challenges	
  
 •  Diverse	
  data	
  sources	
  
 •  In-­‐depth	
  analy-cs	
  (e.g.	
  predic-ve	
  modeling)	
  
 •  Real	
  -me	
  performance	
  at	
  scale	
  
 Solu-on	
  
     –  Integrate	
  Hadoop	
  with	
  RDBMS	
  
     –  Develop	
  and	
  integrate	
  analy-c	
  libraries	
  
     –  Make	
  analy-c	
  jobs	
  interac-ve	
  (not	
  batch	
  oriented)	
  


                                                                                                                        11
Example:	
  Customer	
  Behavior	
  Analysis	
  

Business	
  Opportunity	
  
•  Analyze	
  customer	
  behavior	
  to	
  increase	
  loyalty	
  and	
  trust,	
  
   allocate	
  adver-sing	
  spend,	
  op-mize	
  product	
  incen-ves,	
  
                                                                                                         Golden	
  Path	
  Analysis:	
  
                                                                                                                                    	
  
   iden-fy	
  fraud,	
  micro-­‐segment	
  customer	
  base.
                                                                                                       ComparaSve	
  Performance	
       	
  
Challenges	
  
                                                                                           ETL	
  +	
  RDBMS	
  &	
  SQL	
  =	
  200	
  minutes	
  
•  Full	
  website	
  session-­‐level	
  data	
  needed,	
  typically	
  from	
  
   raw	
  web	
  logs	
                                                                    Hadoop	
  +	
  RDBMS	
  =	
  135	
  mins	
  
•  Requires	
  complex	
  mul--­‐pass	
  SQL	
  queries	
  or	
  	
                        Hadapt	
  =	
  11	
  minutes	
  	
  
   new	
  Non-­‐SQL	
  techniques	
  
•  Requires	
  rewri-ng	
  query	
  to	
  change	
  number	
  of	
  clicks	
             Example	
  AnalySc	
  QuesSons	
  
   analyzed                                                                              •  Which	
  life	
  events	
  are	
  strong	
  opportun-es	
  for	
  
                                                                                            me	
  to	
  becer	
  engage	
  my	
  customers?	
  
Hadapt	
  Value	
                                                                        •  When	
  am	
  I	
  about	
  to	
  lose	
  a	
  customer?	
  
                                                                                         •  What	
  are	
  my	
  top	
  segments?	
  
•  Performance:	
  Single	
  pass	
  over	
  data	
  regardless	
  of	
  
                                                                                         •  Which	
  ad	
  campaigns	
  produced	
  the	
  most	
  liA?	
  
   number	
  of	
  clicks	
  analyzed	
  
                                                                                         •  What	
  products	
  can	
  I	
  bundle	
  to	
  increase	
  sales?	
  
•  Ease	
  of	
  Dev	
  &	
  Ease	
  of	
  Manageability:	
  Much	
  simpler	
           •  Are	
  my	
  online	
  offers	
  canibalizing	
  my	
  in-­‐store	
  
   code	
                                                                                   sales?	
  
•  Ease	
  of	
  Use:	
  PaPern	
  flexibility	
  to	
  handle	
  varied	
  numbers	
     •  What	
  models	
  are	
  my	
  customers	
  following	
  so	
  I	
  
   of	
  clicks	
  and	
  click	
  pacerns	
  without	
  requiring	
  any	
  code	
         can	
  becer	
  predict	
  their	
  next	
  move?	
  
   rewrite	
  


                                                                                                                                                                     12
Example:	
  Social	
  Media	
  Analysis	
  	
  
Business	
  Opportunity	
  
•  Iden-fy	
  influencers	
  based	
  not	
  only	
  on	
  #	
  of	
  followers	
  and	
  re-­‐tweets,	
  but	
  also	
  
   messaging	
  content	
  and	
  sen-ment	
  in	
  reply/re-­‐tweets	
  
•  Aggregate	
  individual	
  sen-ments	
  by	
  incorpora-ng	
  tweet	
  authors’	
  influence	
  
   scores	
  
•  What	
  phrases	
  or	
  product	
  defects	
  do	
  customers	
  oAen	
  men-on	
  before	
  they	
  
   acrite?	
  
Challenges	
  
•  Ingest	
  and	
  analyze	
  high	
  speed	
  incoming	
  events	
  
•  High	
  quality	
  sen-ment	
  output	
  (NLP	
  +	
  Big	
  Data)	
  
•  Insights	
  generated	
  across	
  data	
  sets	
  
Solu-on	
  
    –  Enhance	
  Hadoop	
  with	
  becer	
  interac-vity	
  
    –  Integrate	
  NLP	
  packages	
  to	
  Big	
  Data	
  plaporm	
  
    –  Ingest,	
  analyze,	
  and	
  store	
  all	
  datasets	
  in	
  one	
  plaporm	
  

                                                                                                                           13
Example:	
  Text	
  Analysis	
  &	
  e-­‐Discovery	
  
Business	
  Goal	
  
•  Archive	
  ALL	
  electronic	
  documents	
  –	
  email,	
  Office,	
  
   PDF,	
  instant	
  messages,	
  etc	
  –	
  in	
  a	
  reference	
  archive,	
  
   retaining	
  original	
  document	
  formats.	
  Provide	
  rapid,	
                              Building	
  the	
  Archive:    	
  
   flexible	
  access	
  and	
  extrac-on	
  capabili-es	
  for	
                                   Scalability	
  and	
  Cost	
  Issues  	
  
   eDiscovery	
  and	
  compliance	
  measures.	
  
                                                                                           Teradata/Netezza	
  -­‐	
  $50K	
  –	
  100K/TB	
  
Challenges	
  
•  Massive	
  scale	
  of	
  documents	
  in	
  mul-ple	
  formats	
  and	
  
                                                                                           Search	
  engine	
  -­‐	
  $100K/TB	
  
   structures.	
                                                                           IntegraSon	
  costs:	
  $150K	
  
•  Sophis-cated	
  query	
  and	
  analysis	
  requirements.	
                             Total:	
  $200K/TB	
  +	
  $150K	
  
•  Future	
  formats	
  impossible	
  to	
  predict.	
  
•  Must	
  retain	
  original	
  document	
  format.	
                                     Example	
  AnalySc	
  QuesSons	
  
Hadapt	
  Value	
                                                                          •    Retrieve	
  all	
  emails	
  and	
  instant	
  messages	
  
                                                                                                from	
  all	
  employees	
  in	
  Denver	
  office	
  
•  Cost-­‐effecSve:	
  scale	
  to	
  100s	
  of	
  TB	
  and	
  PB	
  of	
  original	
          between	
  1995	
  and	
  1998	
  
     document	
  storage.	
  
                                                                                           •    Who	
  are	
  the	
  top	
  10	
  recipients	
  of	
  emails	
  
•  Flexible	
  query	
  access:	
  use	
  SQL,	
  Full	
  Text	
  Search,	
  or	
               from	
  Bob	
  Smith	
  
     combine	
  SQL+Search.	
  
•  PreventaSve	
  analysis:	
  apply	
  deduplica-on,	
  
     sen-ment	
  analysis,	
  categoriza-on	
  to	
  accelerate	
  
     document	
  assessment.	
  
	
  
	
                                                                                                                                                                 14
Hadapt	
  –	
  Key	
  Considera-ons	
  
Simplicity	
  
•       All-­‐in-­‐one	
  system	
  for	
  “mul--­‐structured”	
  data	
  analy-cs	
  
•       Single	
  cluster	
  for	
  analysis	
  of	
  mul-ple	
  data	
  types	
  –	
  low	
  TCO,	
  high	
  performance	
  
•       Analyze	
  rela-onal	
  &	
  unstructured	
  data	
  together	
  to	
  answer	
  new	
  ques-ons	
  
•       Eliminate	
  data	
  movement	
  between	
  Hadoop	
  and	
  RDBMS	
  
•       Use	
  SQL	
  +	
  Full	
  Text	
  Search	
  –	
  a	
  fully	
  integrated	
  solu-on	
  
	
  
Accessibility	
  
•  Leverage	
  exis-ng	
  investment	
  in	
  SQL	
  tools	
  and	
  skills	
  
•  Can	
  roll	
  out	
  Hadapt	
  analy-cs	
  to	
  exis-ng	
  BI	
  tool	
  users	
  
•  Makes	
  Hadoop	
  easier	
  to	
  adopt	
  for	
  SQL-­‐heavy	
  enterprises	
  

Scalability	
  /	
  Performance	
  	
  
•  Enormous	
  performance	
  boost	
  for	
  mul--­‐structured	
  data	
  analysis	
  
•  Adap-ve	
  query	
  planning	
  provides	
  on-­‐the-­‐fly	
  load	
  balancing	
  &	
  fault	
  tolerance	
  
•  Ad-­‐hoc	
  and	
  interac-ve	
  querying	
  of	
  massive	
  data	
  sets	
  
	
  
	
  
 


QUESTIONS?
	
  
	
  
             	
  

More Related Content

Viewers also liked

VOC & Unstructured Data
VOC & Unstructured DataVOC & Unstructured Data
VOC & Unstructured Data
Genex_Insights
 
Voice of the Customer_CX-WMCLARKE
Voice of the Customer_CX-WMCLARKEVoice of the Customer_CX-WMCLARKE
Voice of the Customer_CX-WMCLARKE
William Clarke
 
VNSG Congress 2014 SAP BIGdata Analytics vision & strategy
VNSG Congress 2014 SAP BIGdata Analytics vision & strategyVNSG Congress 2014 SAP BIGdata Analytics vision & strategy
VNSG Congress 2014 SAP BIGdata Analytics vision & strategy
Waldemar Adams
 
Domain Expertise and Unstructured Data
Domain Expertise and Unstructured DataDomain Expertise and Unstructured Data
Domain Expertise and Unstructured Data
odsc
 
The dark side of IoT
The dark side of IoT The dark side of IoT
The dark side of IoT
Matthias Steiner
 
Dealing with Dark Data
Dealing with Dark DataDealing with Dark Data
Dealing with Dark Data
Simplex Consulting
 
Structured Data and Semantic SEO
Structured Data and Semantic SEOStructured Data and Semantic SEO
Structured Data and Semantic SEO
Matthew Brown
 
Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Mak...
Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Mak...Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Mak...
Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Mak...
Peter Wren-Hilton
 
Unstructured data processing webinar 06272016
Unstructured data processing webinar 06272016Unstructured data processing webinar 06272016
Unstructured data processing webinar 06272016
George Roth
 
The Analytic System: Finding Patterns in the Data
The Analytic System: Finding Patterns in the DataThe Analytic System: Finding Patterns in the Data
The Analytic System: Finding Patterns in the Data
Health Catalyst
 
Structured, Unstructured and Streaming Big Data on the AWS
Structured, Unstructured and Streaming Big Data on the AWSStructured, Unstructured and Streaming Big Data on the AWS
Structured, Unstructured and Streaming Big Data on the AWS
Amazon Web Services
 
Lecture 11 Unstructured Data and the Data Warehouse
Lecture 11 Unstructured Data and the Data WarehouseLecture 11 Unstructured Data and the Data Warehouse
Lecture 11 Unstructured Data and the Data Warehouse
phanleson
 
Unstructured Data in BI
Unstructured Data in BIUnstructured Data in BI
Unstructured Data in BI
Monaheng Diaho
 
The Intelligent Thing -- Using In-Memory for Big Data and Beyond
The Intelligent Thing -- Using In-Memory for Big Data and BeyondThe Intelligent Thing -- Using In-Memory for Big Data and Beyond
The Intelligent Thing -- Using In-Memory for Big Data and Beyond
Inside Analysis
 
Semistructured Data Seach
Semistructured Data SeachSemistructured Data Seach
Semistructured Data Seach
krisztianbalog
 
Getting Started with Unstructured Data
Getting Started with Unstructured DataGetting Started with Unstructured Data
Getting Started with Unstructured Data
Christine Connors
 
Drive Insight From Unstructured Data With Endeca
Drive Insight From Unstructured Data With EndecaDrive Insight From Unstructured Data With Endeca
Drive Insight From Unstructured Data With Endeca
KPI Partners
 
Structured Data Presentation
Structured Data PresentationStructured Data Presentation
Structured Data Presentation
Shawn Day
 
Analyzing Unstructured Data in Hadoop Webinar
Analyzing Unstructured Data in Hadoop WebinarAnalyzing Unstructured Data in Hadoop Webinar
Analyzing Unstructured Data in Hadoop Webinar
Datameer
 
Structured and Unstructured Big Data ebook
Structured and Unstructured Big Data ebookStructured and Unstructured Big Data ebook
Structured and Unstructured Big Data ebook
Emcien Corporation
 

Viewers also liked (20)

VOC & Unstructured Data
VOC & Unstructured DataVOC & Unstructured Data
VOC & Unstructured Data
 
Voice of the Customer_CX-WMCLARKE
Voice of the Customer_CX-WMCLARKEVoice of the Customer_CX-WMCLARKE
Voice of the Customer_CX-WMCLARKE
 
VNSG Congress 2014 SAP BIGdata Analytics vision & strategy
VNSG Congress 2014 SAP BIGdata Analytics vision & strategyVNSG Congress 2014 SAP BIGdata Analytics vision & strategy
VNSG Congress 2014 SAP BIGdata Analytics vision & strategy
 
Domain Expertise and Unstructured Data
Domain Expertise and Unstructured DataDomain Expertise and Unstructured Data
Domain Expertise and Unstructured Data
 
The dark side of IoT
The dark side of IoT The dark side of IoT
The dark side of IoT
 
Dealing with Dark Data
Dealing with Dark DataDealing with Dark Data
Dealing with Dark Data
 
Structured Data and Semantic SEO
Structured Data and Semantic SEOStructured Data and Semantic SEO
Structured Data and Semantic SEO
 
Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Mak...
Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Mak...Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Mak...
Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Mak...
 
Unstructured data processing webinar 06272016
Unstructured data processing webinar 06272016Unstructured data processing webinar 06272016
Unstructured data processing webinar 06272016
 
The Analytic System: Finding Patterns in the Data
The Analytic System: Finding Patterns in the DataThe Analytic System: Finding Patterns in the Data
The Analytic System: Finding Patterns in the Data
 
Structured, Unstructured and Streaming Big Data on the AWS
Structured, Unstructured and Streaming Big Data on the AWSStructured, Unstructured and Streaming Big Data on the AWS
Structured, Unstructured and Streaming Big Data on the AWS
 
Lecture 11 Unstructured Data and the Data Warehouse
Lecture 11 Unstructured Data and the Data WarehouseLecture 11 Unstructured Data and the Data Warehouse
Lecture 11 Unstructured Data and the Data Warehouse
 
Unstructured Data in BI
Unstructured Data in BIUnstructured Data in BI
Unstructured Data in BI
 
The Intelligent Thing -- Using In-Memory for Big Data and Beyond
The Intelligent Thing -- Using In-Memory for Big Data and BeyondThe Intelligent Thing -- Using In-Memory for Big Data and Beyond
The Intelligent Thing -- Using In-Memory for Big Data and Beyond
 
Semistructured Data Seach
Semistructured Data SeachSemistructured Data Seach
Semistructured Data Seach
 
Getting Started with Unstructured Data
Getting Started with Unstructured DataGetting Started with Unstructured Data
Getting Started with Unstructured Data
 
Drive Insight From Unstructured Data With Endeca
Drive Insight From Unstructured Data With EndecaDrive Insight From Unstructured Data With Endeca
Drive Insight From Unstructured Data With Endeca
 
Structured Data Presentation
Structured Data PresentationStructured Data Presentation
Structured Data Presentation
 
Analyzing Unstructured Data in Hadoop Webinar
Analyzing Unstructured Data in Hadoop WebinarAnalyzing Unstructured Data in Hadoop Webinar
Analyzing Unstructured Data in Hadoop Webinar
 
Structured and Unstructured Big Data ebook
Structured and Unstructured Big Data ebookStructured and Unstructured Big Data ebook
Structured and Unstructured Big Data ebook
 

Similar to Analyzing Multi-Structured Data

BSC 3362 - Big Data and Social Analytics - IOD Conference (IBM)
BSC 3362 - Big Data and Social Analytics - IOD Conference (IBM)BSC 3362 - Big Data and Social Analytics - IOD Conference (IBM)
BSC 3362 - Big Data and Social Analytics - IOD Conference (IBM)
Mark Heid
 
Streaming Hadoop for Enterprise Adoption
Streaming Hadoop for Enterprise AdoptionStreaming Hadoop for Enterprise Adoption
Streaming Hadoop for Enterprise Adoption
DATAVERSITY
 
Big Data Marketing Analytics
Big Data Marketing AnalyticsBig Data Marketing Analytics
Big Data Marketing Analytics
Akash Tyagi
 
Think Big Analytics AWS for Financial Services
Think Big Analytics AWS for Financial ServicesThink Big Analytics AWS for Financial Services
Think Big Analytics AWS for Financial Services
Amazon Web Services
 
Zakipoint Introduction
Zakipoint IntroductionZakipoint Introduction
Zakipoint Introduction
rameshkbudhani
 
Big data? No. Big Decisions are What You Want
Big data? No. Big Decisions are What You WantBig data? No. Big Decisions are What You Want
Big data? No. Big Decisions are What You Want
Stuart Miniman
 
Big data and bi best practices slidedeck
Big data and bi best practices slidedeckBig data and bi best practices slidedeck
Big data and bi best practices slidedeck
Actian Corporation
 
Big Data and BI Best Practices
Big Data and BI Best PracticesBig Data and BI Best Practices
Big Data and BI Best Practices
Yellowfin
 
Blueprint for integrating big data analytics and bi
Blueprint for integrating big data analytics and biBlueprint for integrating big data analytics and bi
Blueprint for integrating big data analytics and bi
DataWorks Summit
 
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
BigMine
 
Big Data and Implications on Platform Architecture
Big Data and Implications on Platform ArchitectureBig Data and Implications on Platform Architecture
Big Data and Implications on Platform Architecture
Odinot Stanislas
 
Ibm big data ibm marriage of hadoop and data warehousing
Ibm big dataibm marriage of hadoop and data warehousingIbm big dataibm marriage of hadoop and data warehousing
Ibm big data ibm marriage of hadoop and data warehousing
DataWorks Summit
 
Anexinet Big Data Solutions
Anexinet Big Data SolutionsAnexinet Big Data Solutions
Anexinet Big Data Solutions
Mark Kromer
 
Secure Big Data Analytics - Hadoop & Intel
Secure Big Data Analytics - Hadoop & IntelSecure Big Data Analytics - Hadoop & Intel
Secure Big Data Analytics - Hadoop & Intel
Intel - API Security & Tokenization
 
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
Infochimps, a CSC Big Data Business
 
Technically Speaking: How Self-Service Analytics Fosters Collaboration
Technically Speaking: How Self-Service Analytics Fosters CollaborationTechnically Speaking: How Self-Service Analytics Fosters Collaboration
Technically Speaking: How Self-Service Analytics Fosters Collaboration
Inside Analysis
 
Evoke final 2013 berkeley
Evoke final 2013 berkeleyEvoke final 2013 berkeley
Evoke final 2013 berkeley
Stanford University
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinery
Steve Loughran
 
Hadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranHadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve Loughran
JAX London
 
predictive analysis and usage in procurement ppt 2017
predictive analysis and usage in procurement  ppt 2017predictive analysis and usage in procurement  ppt 2017
predictive analysis and usage in procurement ppt 2017
Prashant Bhatmule
 

Similar to Analyzing Multi-Structured Data (20)

BSC 3362 - Big Data and Social Analytics - IOD Conference (IBM)
BSC 3362 - Big Data and Social Analytics - IOD Conference (IBM)BSC 3362 - Big Data and Social Analytics - IOD Conference (IBM)
BSC 3362 - Big Data and Social Analytics - IOD Conference (IBM)
 
Streaming Hadoop for Enterprise Adoption
Streaming Hadoop for Enterprise AdoptionStreaming Hadoop for Enterprise Adoption
Streaming Hadoop for Enterprise Adoption
 
Big Data Marketing Analytics
Big Data Marketing AnalyticsBig Data Marketing Analytics
Big Data Marketing Analytics
 
Think Big Analytics AWS for Financial Services
Think Big Analytics AWS for Financial ServicesThink Big Analytics AWS for Financial Services
Think Big Analytics AWS for Financial Services
 
Zakipoint Introduction
Zakipoint IntroductionZakipoint Introduction
Zakipoint Introduction
 
Big data? No. Big Decisions are What You Want
Big data? No. Big Decisions are What You WantBig data? No. Big Decisions are What You Want
Big data? No. Big Decisions are What You Want
 
Big data and bi best practices slidedeck
Big data and bi best practices slidedeckBig data and bi best practices slidedeck
Big data and bi best practices slidedeck
 
Big Data and BI Best Practices
Big Data and BI Best PracticesBig Data and BI Best Practices
Big Data and BI Best Practices
 
Blueprint for integrating big data analytics and bi
Blueprint for integrating big data analytics and biBlueprint for integrating big data analytics and bi
Blueprint for integrating big data analytics and bi
 
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
 
Big Data and Implications on Platform Architecture
Big Data and Implications on Platform ArchitectureBig Data and Implications on Platform Architecture
Big Data and Implications on Platform Architecture
 
Ibm big data ibm marriage of hadoop and data warehousing
Ibm big dataibm marriage of hadoop and data warehousingIbm big dataibm marriage of hadoop and data warehousing
Ibm big data ibm marriage of hadoop and data warehousing
 
Anexinet Big Data Solutions
Anexinet Big Data SolutionsAnexinet Big Data Solutions
Anexinet Big Data Solutions
 
Secure Big Data Analytics - Hadoop & Intel
Secure Big Data Analytics - Hadoop & IntelSecure Big Data Analytics - Hadoop & Intel
Secure Big Data Analytics - Hadoop & Intel
 
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
 
Technically Speaking: How Self-Service Analytics Fosters Collaboration
Technically Speaking: How Self-Service Analytics Fosters CollaborationTechnically Speaking: How Self-Service Analytics Fosters Collaboration
Technically Speaking: How Self-Service Analytics Fosters Collaboration
 
Evoke final 2013 berkeley
Evoke final 2013 berkeleyEvoke final 2013 berkeley
Evoke final 2013 berkeley
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinery
 
Hadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranHadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve Loughran
 
predictive analysis and usage in procurement ppt 2017
predictive analysis and usage in procurement  ppt 2017predictive analysis and usage in procurement  ppt 2017
predictive analysis and usage in procurement ppt 2017
 

More from DataWorks Summit

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
c5vrf27qcz
 
Principle of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptxPrinciple of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptx
BibashShahi
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
AstuteBusiness
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
Safe Software
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
DianaGray10
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
Ajin Abraham
 
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
Fwdays
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
Miro Wengner
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
DianaGray10
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
Neo4j
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
Neo4j
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
Edge AI and Vision Alliance
 

Recently uploaded (20)

Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
 
Principle of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptxPrinciple of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptx
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
 
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
 

Analyzing Multi-Structured Data

  • 1.     Applying  Big  Data  Analy-cs.     Analyzing Multi-Structured Data with Hadoop Justin Borgman CEO & Co-Founder
  • 2. Company  Profile   •  30  people,  based  in  Cambridge,  MA   •  Founded  in  July,  2010   •  Raised  $9.5M  Series  A  from  Bessemer   and  Norwest   •  CEO  &  Co-­‐Founder   •  Based  on  the  HadoopDB  research   •  Previously  spent  7  years  as  a   project  in  the  Yale  Computer  Science   soAware  developer  at  MIT   Department  by  Daniel  Abadi,  et.  al.   Lincoln  Laboratory  and  product   manager  at  startup  Covectra     •  Undergrad:  UMass  Amherst     •  Grad:  Yale  University     2
  • 3. Big  Data:  Volume  |  Variety|  Velocity  |  VALUE   Source:  wikibon.org  
  • 4. Big  Data  in  the  Headlines   “How Target Figured Out A Teen Girl Was Pregnant Before Her “Digital universe” grew by 62% last year to 800K Father Did” petabytes & will grow to 1.2 zettabytes this year “Why Netflix produces BBC remake starring Kevin Spacey, directed by David Fincher” 4
  • 5. The  Big  Data  Ecosystem  
  • 6. Example:  Big  Data  Analysis  Process   HADOOP MPP DBMS Raw Data load extract Aggregate Sample Filter predict Web access logs Click logs Impressions Email Term extraction Tweets Entity extraction Sensor data Sentiment analysis Documents Geocoding Cleanse Sessionization Join Applications BI Tools Predictive analytics Business  Analyst  
  • 7. Example:  Hadapt  Analysis  Process   Raw Data load predict Applications BI Tools Predictive analytics
  • 8. The  Evolu-on  of  Analy-cs  –  Where  are  we  today?         The  early  stages  of  analy-cs     •  Market  Basket  Analysis   •  Trend  Analysis   •  Cyclical  Analysis   •  Customer  Segmenta-on   New  Analy-cal  Models   •  Pacern  Detec-on,  Discovery,  Matching   •  A/B  Tes-ng  and  Behavioral  Analysis   •  Sessioniza-on   •  Social  Correla-on  Analysis     •  Frac-onal  Acribu-on   •  Sen-ment  Analysis     •  Personaliza-on     8
  • 9. Big  Data  in  Ac-on     •  Amazon  and  Ne)lix  engage  in  arbitrage  on  video  content  based  on  customer  behavior   •  Harvard  predicts  the  spread  of  cholera  in  Hai-,  and  Derwent  Capital  out-­‐trades  the  market  based   on  tweets  and  their  sen-ment   •  En-re  ecosystems  were  shotgun  gene  sequenced  by  Celera.     •  Life  events  are  predicted  by  Target  and  marketed  accordingly   •  *Osco  Drug  increased  sales  by  op-mizing  product  placement,  e.g.  beer  and  diapers   •  Ads  are  op-mally  placed  and  priced  *for  you*  by  DataXu  in  real  -me   •  Next  Big  Sound  predicts  new  ar-sts  and  hits  based  on  signals  from  social  media   •  Real-­‐-me  produc-on  op-miza-on  saves  Chevron  over  $1B/year   •  Retailer  web  sites  are  re-­‐organized  and  re-­‐op-mized  for  content  by  Bloomreach   •  LinkedIn  suggests  who  you  might  know,  eHarmony  suggests  who  you  might  love  
  • 10. Example:  POS  Data  Insights   10
  • 11. Example:  e-­‐Tailer   Business  Opportunity   •  Should  I  run  a  promo-on  among  the  Lady  Gaga  fans  or  Jus-n  Bieber  fans?   •  Based  on  shopping  cart  and  browsing/purchase  history,  what  other   products  should  be  recommended  before  the  customer  checks  out?   •  Which  items  are  oAen  purchased  together,  and  any  correla:on  with   shopping  date/-me,  customer  age,  gender,  etc?   Challenges   •  Diverse  data  sources   •  In-­‐depth  analy-cs  (e.g.  predic-ve  modeling)   •  Real  -me  performance  at  scale   Solu-on   –  Integrate  Hadoop  with  RDBMS   –  Develop  and  integrate  analy-c  libraries   –  Make  analy-c  jobs  interac-ve  (not  batch  oriented)   11
  • 12. Example:  Customer  Behavior  Analysis   Business  Opportunity   •  Analyze  customer  behavior  to  increase  loyalty  and  trust,   allocate  adver-sing  spend,  op-mize  product  incen-ves,   Golden  Path  Analysis:     iden-fy  fraud,  micro-­‐segment  customer  base. ComparaSve  Performance     Challenges   ETL  +  RDBMS  &  SQL  =  200  minutes   •  Full  website  session-­‐level  data  needed,  typically  from   raw  web  logs   Hadoop  +  RDBMS  =  135  mins   •  Requires  complex  mul--­‐pass  SQL  queries  or     Hadapt  =  11  minutes     new  Non-­‐SQL  techniques   •  Requires  rewri-ng  query  to  change  number  of  clicks   Example  AnalySc  QuesSons   analyzed •  Which  life  events  are  strong  opportun-es  for   me  to  becer  engage  my  customers?   Hadapt  Value   •  When  am  I  about  to  lose  a  customer?   •  What  are  my  top  segments?   •  Performance:  Single  pass  over  data  regardless  of   •  Which  ad  campaigns  produced  the  most  liA?   number  of  clicks  analyzed   •  What  products  can  I  bundle  to  increase  sales?   •  Ease  of  Dev  &  Ease  of  Manageability:  Much  simpler   •  Are  my  online  offers  canibalizing  my  in-­‐store   code   sales?   •  Ease  of  Use:  PaPern  flexibility  to  handle  varied  numbers   •  What  models  are  my  customers  following  so  I   of  clicks  and  click  pacerns  without  requiring  any  code   can  becer  predict  their  next  move?   rewrite   12
  • 13. Example:  Social  Media  Analysis     Business  Opportunity   •  Iden-fy  influencers  based  not  only  on  #  of  followers  and  re-­‐tweets,  but  also   messaging  content  and  sen-ment  in  reply/re-­‐tweets   •  Aggregate  individual  sen-ments  by  incorpora-ng  tweet  authors’  influence   scores   •  What  phrases  or  product  defects  do  customers  oAen  men-on  before  they   acrite?   Challenges   •  Ingest  and  analyze  high  speed  incoming  events   •  High  quality  sen-ment  output  (NLP  +  Big  Data)   •  Insights  generated  across  data  sets   Solu-on   –  Enhance  Hadoop  with  becer  interac-vity   –  Integrate  NLP  packages  to  Big  Data  plaporm   –  Ingest,  analyze,  and  store  all  datasets  in  one  plaporm   13
  • 14. Example:  Text  Analysis  &  e-­‐Discovery   Business  Goal   •  Archive  ALL  electronic  documents  –  email,  Office,   PDF,  instant  messages,  etc  –  in  a  reference  archive,   retaining  original  document  formats.  Provide  rapid,   Building  the  Archive:   flexible  access  and  extrac-on  capabili-es  for   Scalability  and  Cost  Issues   eDiscovery  and  compliance  measures.   Teradata/Netezza  -­‐  $50K  –  100K/TB   Challenges   •  Massive  scale  of  documents  in  mul-ple  formats  and   Search  engine  -­‐  $100K/TB   structures.   IntegraSon  costs:  $150K   •  Sophis-cated  query  and  analysis  requirements.   Total:  $200K/TB  +  $150K   •  Future  formats  impossible  to  predict.   •  Must  retain  original  document  format.   Example  AnalySc  QuesSons   Hadapt  Value   •  Retrieve  all  emails  and  instant  messages   from  all  employees  in  Denver  office   •  Cost-­‐effecSve:  scale  to  100s  of  TB  and  PB  of  original   between  1995  and  1998   document  storage.   •  Who  are  the  top  10  recipients  of  emails   •  Flexible  query  access:  use  SQL,  Full  Text  Search,  or   from  Bob  Smith   combine  SQL+Search.   •  PreventaSve  analysis:  apply  deduplica-on,   sen-ment  analysis,  categoriza-on  to  accelerate   document  assessment.       14
  • 15. Hadapt  –  Key  Considera-ons   Simplicity   •  All-­‐in-­‐one  system  for  “mul--­‐structured”  data  analy-cs   •  Single  cluster  for  analysis  of  mul-ple  data  types  –  low  TCO,  high  performance   •  Analyze  rela-onal  &  unstructured  data  together  to  answer  new  ques-ons   •  Eliminate  data  movement  between  Hadoop  and  RDBMS   •  Use  SQL  +  Full  Text  Search  –  a  fully  integrated  solu-on     Accessibility   •  Leverage  exis-ng  investment  in  SQL  tools  and  skills   •  Can  roll  out  Hadapt  analy-cs  to  exis-ng  BI  tool  users   •  Makes  Hadoop  easier  to  adopt  for  SQL-­‐heavy  enterprises   Scalability  /  Performance     •  Enormous  performance  boost  for  mul--­‐structured  data  analysis   •  Adap-ve  query  planning  provides  on-­‐the-­‐fly  load  balancing  &  fault  tolerance   •  Ad-­‐hoc  and  interac-ve  querying  of  massive  data  sets