• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Big Data beyond Apache Hadoop - How to integrate ALL your Data
 

Big Data beyond Apache Hadoop - How to integrate ALL your Data

on

  • 3,094 views

Big data represents a significant paradigm shift in enterprise technology. Big data radically changes the nature of the data management profession as it introduces new concerns about the volume, ...

Big data represents a significant paradigm shift in enterprise technology. Big data radically changes the nature of the data management profession as it introduces new concerns about the volume, velocity and variety of corporate data.
Apache Hadoop is the open source defacto standard for implementing big data solutions on the Java platform. Hadoop consists of its kernel, MapReduce, and the Hadoop Distributed Filesystem (HDFS). A challenging task is to send all data to Hadoop for processing and storage (and then get it back to your application later), because in practice data comes from many different applications (SAP, Salesforce, Siebel, etc.) and databases (File, SQL, NoSQL), uses different technologies and concepts for communication (e.g. HTTP, FTP, RMI, JMS), and consists of different data formats using CSV, XML, binary data, or other alternatives.
This session shows the powerful combination of Apache Hadoop and Apache Camel to solve this challenging task. Learn how to use every thinkable data with Hadoop – without plenty of complex or redundant boilerplate code. Besides supporting the integration of all different technologies and data formats, Apache Camel also offers an easy, standardized DSL to transform, split or filter incoming data using the Enterprise Integration Patterns (EIP). Therefore, Apache Hadoop and Apache Camel are a perfect match for processing big data on the Java platform.

Statistics

Views

Total Views
3,094
Views on SlideShare
1,984
Embed Views
1,110

Actions

Likes
8
Downloads
0
Comments
1

2 Embeds 1,110

http://www.kai-waehner.de 1106
https://twitter.com 4

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • Can u tell me what version of hadoop did u use with camel ?
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Big Data beyond Apache Hadoop - How to integrate ALL your Data Big Data beyond Apache Hadoop - How to integrate ALL your Data Presentation Transcript

    • Big data beyond Hadoop –How to integrate ALL your dataKai  Wähner  kwaehner@talend.com  @KaiWaehner  www.kai-­‐waehner.de  4/26/13  
    • ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner    ConsultingDevelopingCoachingSpeakingWritingMain TasksRequirements EngineeringEnterprise Architecture ManagementBusiness Process ManagementArchitecture and Development of ApplicationsService-oriented ArchitectureIntegration of Legacy ApplicationsCloud ComputingBig DataContactEmail: kontakt@kai-waehner.deBlog: www.kai-waehner.de/blogTwitter: @KaiWaehnerSocial Networks: Xing, LinkedInKai Wähner
    • ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner    Key messagesYou have to care about big data to be competitive in the future!You have to integrate different sources to get most value out of it!Big data integration is no (longer) rocket science!
    • ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner    • Big  data  paradigm  shiM    • Challenges  of  big  data  • Big  data  from  a  technology  perspecPve  • IntegraPon  with  an  open  source  framework  • IntegraPon  with  an  open  source  suite  Agenda
    • ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner    • Big  data  paradigm  shiM    • Challenges  of  big  data  • Big  data  from  a  technology  perspecPve  • IntegraPon  with  an  open  source  framework  • IntegraPon  with  an  open  source  suite  Agenda
    • ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner    William  Edwards  Deming    (1900  –1993)    American  staPsPcian,  professor,    author,  lecturer  and  consultant  “If  you  cant  measure  it,    you  cant  manage  it.”  Why should you care about big data?
    • ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner    è  „Silence  the  HiPPOs“  (highest-­‐paid  person‘s  opinion)  è  Being  able  to  interpret  unimaginable  large  data  stream,  the  gut  feeling  is  no  longer  jusPfied!    Why should you care about big data?
    • ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner    What is big data? The Vs of big dataVolume    (terabytes,  petabytes)                    Variety    (social  networks,  blog  posts,  logs,  sensors,  etc.)            Velocity                (realPme  or  near-­‐realPme)          Value  
    • ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner    Big  Data  Integra3on  –  Land  data  in  a  Big  Data  cluster  –  Implement  or  generate  parallel  processes      Big  Data  Manipula3on  –  Simplify  manipulaPon,  such  as  sort  and  filter  –  ComputaPonal  expensive  funcPons    Big  Data  Quality  &  Governance  –  IdenPfy  linkages  and  duplicates,  validate  big  data  –  Match  component,  execute  basic  quality  features    Big  Data  Project  Management  –  Place  frameworks  around  big  data  projects  –  Common  Repository,  scheduling,  monitoring    Big data tasks to solve - before analysis
    • ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner    “The  advantage  of  their  new  system  is  that  they  can  now  look  at  their  data  [from  their  log  processing  system]  in  anyway  they  want:  ➜  Nightly  MapReduce  jobs  collect  staPsPcs  about  their  mail  system  such  as  spam  counts  by  domain,  bytes  transferred  and  number  of  logins.    ➜  When  they  wanted  to  find  out  which  part  of  the  world  their  customers  logged  in  from,  a  quick  [ad  hoc]  MapReduce  job  was  created  and  they  had  the  answer  within  a  few  hours.  Not  really  possible  in  your  typical  ETL  system.”  hjp://highscalability.com/how-­‐rackspace-­‐now-­‐uses-­‐mapreduce-­‐and-­‐hadoop-­‐query-­‐terabytes-­‐data  Use case: Replacing ETL jobs
    • ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner    hjp://hkotadia.com/archives/5021  Deduce  Customer    DefecPons  Use case: Risk management
    • ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner    ➜  With  revenue  of  almost  USD  30  billion  and  a  network  of  800  locaPons,  Macys  is  considered  the  largest  store  operator  in  the  USA  ➜  Daily  price  check  analysis  of  its  10,000  arPcles  in  less  than  two  hours  ➜  Whenever  a  neighboring  compePtor  anywhere  between  New  York  and  Los  Angeles  goes  for  aggressive  price  reducPons,  Macys  follows  its  example  ➜  If  there  is  no  market  compePtor,  the  prices  remain  unchanged  hjp://www.t-­‐systems.com/about-­‐t-­‐systems/examples-­‐of-­‐successes-­‐companies-­‐analyze-­‐big-­‐data-­‐in-­‐record-­‐Pme-­‐l-­‐t-­‐systems/1029702  Use case: Flexible pricing
    • ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner    • Big  data  paradigm  shiM    • Challenges  of  big  data  • Big  data  from  a  technology  perspecPve  • IntegraPon  with  an  open  source  framework  • IntegraPon  with  an  open  source  suite  Agenda
    • ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner    This is yourcompanyBig Data GeekLimited big data experts
    • ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner    ➜  Wanna  buy  a  big  data  soluPon  for  your  industry?    ➜  Maybe  a  compePtor  has  a  big  data  soluPon  which  adds  business  value?  ➜  The  compePtor  will  never  publish  it  (rat-­‐race)!  Big data tool selection (business perspective)
    • ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner    Looking  for  ‚your‘  required  big  data  product?  Support  your  data  from  scratch?    Good  luck!  J      Big data tool selection (technical perspective)
    • ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner    How to solve these big data challenges?
    • ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner    à  “[OMen]  simple  models  and      big  data  trump  more-­‐elaborate      [and  complex]  analyPcs  approaches”    à  “OMen  someone  coming  from      outside  an  industry  can  spot      a  bejer  way  to  use  big  data      than  an  insider”         Erik  Brynjolfsson  /  Lynn  Wu    hjp://alfredopassos.tumblr.com/post/32461599327/big-­‐data-­‐the-­‐management-­‐revoluPon-­‐by-­‐andrew-­‐mcafee  Be no expert! Be simple!
    • ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner    à  Look  at  use  cases  of  others      (SMU,  but  also  large  companies)  à  How  can  you  do  something  similar    with  your  data?    à  You  have  different  data  sources?      Use  it!  Combine  it!  Play  with  it!  Be creative!
    • ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner    1)  Do  not  begin  with  the  data,  think  about  business  opportuniPes  2)  Choose  the  right  data  (combine  different  data  sources)  3)  Use  easy  tooling      hjp://hbr.org/2012/10/making-­‐advanced-­‐analyPcs-­‐work-­‐for-­‐you    What is your Big Data process?Step  1   Step  2   Step  3  
    • ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner    • Big  data  paradigm  shiM    • Challenges  of  big  data  • Big  data  from  a  technology  perspecPve  • IntegraPon  with  an  open  source  framework  • IntegraPon  with  an  open  source  suite  Agenda
    • ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner    Technology perspectiveHow  to  process  big  data?  
    • ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner      The  criPcal  flaw  in  parallel  ETL  tools  is  the  fact  that  the  data  is  almost  never  local  to  the  processing  nodes.  This  means  that  every  Pme  a  large  job  is  run,  the  data  has  to  first  be  read  from  the  source,  split  N  ways  and  then  delivered  to  the  individual  nodes.    Worse,  if  the  parPPon  key  of  the  source  doesn’t  match  the  parPPon  key  of  the  target,  data  has  to  be  constantly  exchanged  among  the  nodes.  In  essence,  parallel  ETL  treats  the  network  as  if  it  were  a  physical  I/O  subsystem.    The  network,  which  is  always  the  slowest  part  of  the  process,  becomes  the  weakest  link  in  the  performance  chain.    hjp://blog.syncsort.com/2012/08/parallel-­‐etl-­‐tools-­‐are-­‐dead  How to process big data?
    • ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner    Slides:  hjp://www.slideshare.net/pavlobaron/100-­‐big-­‐data-­‐0-­‐hadoop-­‐0-­‐java    Video:  hjp://www.infoq.com/presentaPons/Big-­‐Data-­‐Hadoop-­‐Java  How to process big data?
    • ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner    The  defacto  standard  for  big  data  processing  How to process big data?
    • ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner    Even  MicrosoM  (the  .NET  house)  relies  on  Hadoop  since  2011  How to process big data?“A  big  part  of  [the  company’s  strategy]  includes  wiring  SQL  Server  2012  (formerly  known  by  the  codename  “Denali”)  to  the  Hadoop  distributed  compuPng  playorm,  and  bringing  Hadoop  to  Windows  Server  and  Azure”  
    • ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner    Apache  Hadoop,  an  open-­‐source  soMware  library,  is  a  framework  that  allows  for  the  distributed  processing  of  large  data  sets  across  clusters  of  commodity  hardware  using  simple  programming  models.  It  is  designed  to  scale  up  from  single  servers  to  thousands  of  machines,  each  offering  local  computaPon  and  storage.      What is Hadoop?
    • ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner    Simple  example  •  Input:  (very  large)  text  files  with  lists  of  strings,  such  as:      „318,  0043012650999991949032412004...0500001N9+01111+99999999999...“  •  We  are  interested  just  in  some  content:  year  and  temperate  (marked  in  red)  •  The  Map  Reduce  funcPon  has  to  compute  the  maximum  temperature  for  every  year  Example  from  the  book  “Hadoop:  The  DefiniPve  Guide,  3rd  EdiPon”  Map (Shuffle) Reduce
    • ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner    How to process big data?
    • ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner    MapReduceHDFSEcosystemFeaturesincludedHadoop  DistribuPon  Big  Data  Suite  few manyApacheHadoopPackagingDeployment-ToolingSupport+Tooling / ModelingCode GenerationSchedulingIntegration+Hadoop alternatives
    • ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner    • Big  data  paradigm  shiM    • Challenges  of  big  data  • Big  data  from  a  technology  perspecPve  • IntegraPon  with  an  open  source  framework  • IntegraPon  with  an  open  source  suite  Agenda
    • ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner    ConnectivityRoutingTransformationComplexityof IntegrationEnterprise  Service  Bus  IntegraPon  Suite  Low HighIntegrationFrameworkINTEGRATIONToolingMonitoringSupport+BUSINESS PROCESS MGT.BIG DATA / MDMREGISTRY / REPOSITORYRULES ENGINE„YOU NAME IT“+Alternatives for systems integration
    • ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner    Complexityof IntegrationEnterprise  Service  Bus  IntegraPon  Suite  Low HighIntegrationFrameworkAlternatives for systems integration
    • ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner    More details about integration frameworks...hjp://www.kai-­‐waehner.de/blog/2012/12/20/showdown-­‐integraPon-­‐framework-­‐spring-­‐integraPon-­‐apache-­‐camel-­‐vs-­‐enterprise-­‐service-­‐bus-­‐esb/  
    • ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner    Enterprise Integration Patterns (EIP)Apache CamelImplements the EIPs
    • ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner    Enterprise Integration Patterns (EIP)
    • ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner    Enterprise Integration Patterns (EIP)
    • ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner    Architecturehjp://java.dzone.com/arPcles/apache-­‐camel-­‐integraPon  
    • ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner    HTTP  FTP  File  XSLT  MQ  JDBCAkka  TCP  SMTP  RSS  Quartz  Log  LDAP  JMS  EJB  AMQP  Atom  AWS-S3  Bean-Validation  CXF  IRC  Jetty  JMX  Lucene  Netty  RMI  SQL  Many many more   Custom ComponentsChoose your required components
    • ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner    Choose your favorite DSLXML(not production-ready yet)
    • ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner    Deploy it wherever you needStandaloneOSGiApplication ServerWeb ContainerSpring ContainerCloud
    • ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner    Enterprise-ready• Open Source• Scalability• Error Handling• Transaction• Monitoring• Tooling• Commercial Support  
    • ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner    Example: Camel integration route
    • ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner    Example: camel-hdfs component// Producerfrom(“jms:MyQueue").to(“hdfs:///myDirectory/myFile.txt?valueType=TEXT");// Consumerfrom(“hdfs:///myDirectory/myFile.txt").to(“file:target/reports/report.txt");
    • ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner    Live demoApache Camel in action...
    • ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner    • Big  data  paradigm  shiM    • Challenges  of  big  data  • Big  data  from  a  technology  perspecPve  • IntegraPon  with  an  open  source  framework  • IntegraPon  with  an  open  source  suite  Agenda
    • ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner    ConnectivityRoutingTransformationComplexityof IntegrationEnterprise  Service  Bus  IntegraPon  Suite  Low HighIntegrationFrameworkINTEGRATIONToolingMonitoringSupport+BUSINESS PROCESS MGT.BIG DATA / MDMREGISTRY / REPOSITORYRULES ENGINE„YOU NAME IT“+Alternatives for systems integration
    • ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner    Complexityof IntegrationEnterprise  Service  Bus  IntegraPon  Suite  Low HighIntegrationFrameworkAlternatives for systems integration
    • ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner    More details about ESBs and suites...hjp://www.kai-­‐waehner.de/blog/2013/01/23/spoilt-­‐for-­‐choice-­‐how-­‐to-­‐choose-­‐the-­‐right-­‐enterprise-­‐service-­‐bus-­‐esb/  
    • ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner    …an  open  source  ecosystem  Talend  Open  Studio  for  Big  Data    •  Improves  efficiency  of  big  data  job  design  with  graphic  interface  •  Generates  Hadoop  code  and  run  transforms  inside  Hadoop  •  NaPve  support  for  HDFS,  Pig,  Hbase,  Hcatalog,  Sqoop  and  Hive  •  100%  open  source  under  an  Apache  License  •  Standards  based  PigVision: Democratize big data
    • ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner    …an  open  source  ecosystem  Talend  PlaAorm  for  Big  Data    •  Builds  on  Talend  Open  Studio  for  Big  Data  •  Adds  data  quality,  advanced  scalability  and  management  funcPons  •  MapReduce  massively  parallel  data  processing  •  Shared  Repository  and  remote  deployment  •  Data  quality  and  profiling  •  Data  cleansing  •  ReporPng  and  dashboards  •  Commercial  support,  warranty/IP  indemnity  under  a  subscripPon  license  PigVision: Democratize big data
    • ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner    Talend Open Studio for Big Data
    • ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner    „Talend  Open  Studio  for  Big  Data“  in  acPon...  Live demo
    • ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner    Did you get the key message?
    • ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner    Key messagesYou have to care about big data to be competitive in the future!You have to integrate different sources to get most value out of it!Big data integration is no (longer) rocket science!
    • ©  Talend  2013        “Big  Data  beyond  Hadoop  –  How  to  integrate  ALL  your  Data”  by  Kai  Wähner    Did you get the key message?
    • Thank you for your attention. Questions?kwaehner@talend.comwww.kai-waehner.deLinkedIn / Xing@KaiWaehner