Data Integration in 2013:
  A working session
  Adam Muise
  March 26 2013




Note: This deck is purposely sparse. Want value?
Join the conversation in the Toronto Hadoop User
Group:
http://www.meetup.com/TorontoHUG/

  © Hortonworks Inc. 2012
Proposed Agenda
•   Introductions
•   Discuss common Data Integration Patterns
•   Round-table of User Group Member CDC/ETL Use Cases
•   New Data Integration Solutions: A change from the Old Guard:
    –   Hadoop and the Data Lake
    –   Streaming (+ Hadoop)
    –   Data Lake Governance / Management (InfoTrellis)
    –   Databus (LinkedIn)




                                                              Page 2
         © Hortonworks Inc. 2012
Introductions
Who let you in?




                               Page 3
     © Hortonworks Inc. 2012
General Data Integration Patterns
• Enterprise Application Integration*
       – Metadata lookup
       – Validation
       – Extra-app communication

• Enterprise Service Bus (SOA, Message Bus/Hub)*
• Federation*
       – Bridging multiple databases with a query layer
       – Eg: Composite

• Extract Transform Load (ETL)*
       – Collection
       – Aggregation
       – Format/Schema transformation

• Data Lake
       – Landing Zone for multiple datasets in one store
       – Mixed schema, often raw structured/unstructured data
       – Eg: Hadoop

* Source: Data Integration Blueprint and Modeling: Techniques for a Scalable and Sustainable Architecture, Anthony David Giordano, 2010, IBM Press.


                                                                                                                                                      Page 4
                © Hortonworks Inc. 2012
Use Case Roundtable
Data that’s keeping you up at night…




                                       Page 5
     © Hortonworks Inc. 2012
Scotia iTrade: Geoffrey Li




                              Page 6
    © Hortonworks Inc. 2012
New Data Integration Solutions

Fresh Ideas to new and old problems…




                                       Page 7
     © Hortonworks Inc. 2012
Hadoop: The Data Lake


                                               Publish Event
                                                Signal Data
                                              Transformation


                                 Model/                  Transform &
                             Apply Metadata               Aggregate
                                                                        Publish
                                                                       Exchange




                                                                       Explore
                                                                       Visualize
      Extract &                                                         Report
      Load




                                                                          Analyze




                                                                                    Page 8
   © Hortonworks Inc. 2012
Streaming & Hadoop




http://developer.yahoo.com/blogs/ydn/posts/2013/02/storm-and-hadoop-convergence-of-big-data-and-low-latency-processing/

                                                                                                                      Page 9
            © Hortonworks Inc. 2012
Streaming & Hadoop




http://developer.yahoo.com/blogs/ydn/posts/2013/02/storm-and-hadoop-convergence-of-big-data-and-low-latency-processing/

                                                                                                                     Page 10
            © Hortonworks Inc. 2012
DataBus (LinkedIn)
Databus is a low latency change capture system which has become an
integral part of LinkedIn’s data processing pipeline. Databus addresses a
fundamental requirement to reliably capture, flow and processes primary
data changes. Databus provides the following features:
   1.    Isolation between sources and consumers
   2.    Guaranteed in order and at least once delivery with high availability
   3.    Consumption from an arbitrary time point in the change stream including full bootstrap
         capability of the entire data.
   4.    Partitioned consumption
   5.    Source consistency preservation




  https://github.com/linkedin/databus/wiki

                                                                                                  Page 11
           © Hortonworks Inc. 2012
DataBus (LinkedIn)




 https://github.com/linkedin/databus/wiki


                                            Page 12
          © Hortonworks Inc. 2012

2013 march 26_thug_etl_cdc_talking_points

  • 1.
    Data Integration in2013: A working session Adam Muise March 26 2013 Note: This deck is purposely sparse. Want value? Join the conversation in the Toronto Hadoop User Group: http://www.meetup.com/TorontoHUG/ © Hortonworks Inc. 2012
  • 2.
    Proposed Agenda • Introductions • Discuss common Data Integration Patterns • Round-table of User Group Member CDC/ETL Use Cases • New Data Integration Solutions: A change from the Old Guard: – Hadoop and the Data Lake – Streaming (+ Hadoop) – Data Lake Governance / Management (InfoTrellis) – Databus (LinkedIn) Page 2 © Hortonworks Inc. 2012
  • 3.
    Introductions Who let youin? Page 3 © Hortonworks Inc. 2012
  • 4.
    General Data IntegrationPatterns • Enterprise Application Integration* – Metadata lookup – Validation – Extra-app communication • Enterprise Service Bus (SOA, Message Bus/Hub)* • Federation* – Bridging multiple databases with a query layer – Eg: Composite • Extract Transform Load (ETL)* – Collection – Aggregation – Format/Schema transformation • Data Lake – Landing Zone for multiple datasets in one store – Mixed schema, often raw structured/unstructured data – Eg: Hadoop * Source: Data Integration Blueprint and Modeling: Techniques for a Scalable and Sustainable Architecture, Anthony David Giordano, 2010, IBM Press. Page 4 © Hortonworks Inc. 2012
  • 5.
    Use Case Roundtable Datathat’s keeping you up at night… Page 5 © Hortonworks Inc. 2012
  • 6.
    Scotia iTrade: GeoffreyLi Page 6 © Hortonworks Inc. 2012
  • 7.
    New Data IntegrationSolutions Fresh Ideas to new and old problems… Page 7 © Hortonworks Inc. 2012
  • 8.
    Hadoop: The DataLake Publish Event Signal Data Transformation Model/ Transform & Apply Metadata Aggregate Publish Exchange Explore Visualize Extract & Report Load Analyze Page 8 © Hortonworks Inc. 2012
  • 9.
  • 10.
  • 11.
    DataBus (LinkedIn) Databus isa low latency change capture system which has become an integral part of LinkedIn’s data processing pipeline. Databus addresses a fundamental requirement to reliably capture, flow and processes primary data changes. Databus provides the following features: 1. Isolation between sources and consumers 2. Guaranteed in order and at least once delivery with high availability 3. Consumption from an arbitrary time point in the change stream including full bootstrap capability of the entire data. 4. Partitioned consumption 5. Source consistency preservation https://github.com/linkedin/databus/wiki Page 11 © Hortonworks Inc. 2012
  • 12.