• Save
2013 march 26_thug_etl_cdc_talking_points
Upcoming SlideShare
Loading in...5
×
 

2013 march 26_thug_etl_cdc_talking_points

on

  • 684 views

Some diagrams for our roundtable on modern ETL/CDC with Hadoop and other new technologies

Some diagrams for our roundtable on modern ETL/CDC with Hadoop and other new technologies

Statistics

Views

Total Views
684
Views on SlideShare
684
Embed Views
0

Actions

Likes
2
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    2013 march 26_thug_etl_cdc_talking_points 2013 march 26_thug_etl_cdc_talking_points Presentation Transcript

    • Data Integration in 2013: A working session Adam Muise March 26 2013Note: This deck is purposely sparse. Want value?Join the conversation in the Toronto Hadoop UserGroup:http://www.meetup.com/TorontoHUG/ © Hortonworks Inc. 2012
    • Proposed Agenda• Introductions• Discuss common Data Integration Patterns• Round-table of User Group Member CDC/ETL Use Cases• New Data Integration Solutions: A change from the Old Guard: – Hadoop and the Data Lake – Streaming (+ Hadoop) – Data Lake Governance / Management (InfoTrellis) – Databus (LinkedIn) Page 2 © Hortonworks Inc. 2012
    • IntroductionsWho let you in? Page 3 © Hortonworks Inc. 2012
    • General Data Integration Patterns• Enterprise Application Integration* – Metadata lookup – Validation – Extra-app communication• Enterprise Service Bus (SOA, Message Bus/Hub)*• Federation* – Bridging multiple databases with a query layer – Eg: Composite• Extract Transform Load (ETL)* – Collection – Aggregation – Format/Schema transformation• Data Lake – Landing Zone for multiple datasets in one store – Mixed schema, often raw structured/unstructured data – Eg: Hadoop* Source: Data Integration Blueprint and Modeling: Techniques for a Scalable and Sustainable Architecture, Anthony David Giordano, 2010, IBM Press. Page 4 © Hortonworks Inc. 2012
    • Use Case RoundtableData that’s keeping you up at night… Page 5 © Hortonworks Inc. 2012
    • Scotia iTrade: Geoffrey Li Page 6 © Hortonworks Inc. 2012
    • New Data Integration SolutionsFresh Ideas to new and old problems… Page 7 © Hortonworks Inc. 2012
    • Hadoop: The Data Lake Publish Event Signal Data Transformation Model/ Transform & Apply Metadata Aggregate Publish Exchange Explore Visualize Extract & Report Load Analyze Page 8 © Hortonworks Inc. 2012
    • Streaming & Hadoophttp://developer.yahoo.com/blogs/ydn/posts/2013/02/storm-and-hadoop-convergence-of-big-data-and-low-latency-processing/ Page 9 © Hortonworks Inc. 2012
    • Streaming & Hadoophttp://developer.yahoo.com/blogs/ydn/posts/2013/02/storm-and-hadoop-convergence-of-big-data-and-low-latency-processing/ Page 10 © Hortonworks Inc. 2012
    • DataBus (LinkedIn)Databus is a low latency change capture system which has become anintegral part of LinkedIn’s data processing pipeline. Databus addresses afundamental requirement to reliably capture, flow and processes primarydata changes. Databus provides the following features: 1. Isolation between sources and consumers 2. Guaranteed in order and at least once delivery with high availability 3. Consumption from an arbitrary time point in the change stream including full bootstrap capability of the entire data. 4. Partitioned consumption 5. Source consistency preservation https://github.com/linkedin/databus/wiki Page 11 © Hortonworks Inc. 2012
    • DataBus (LinkedIn) https://github.com/linkedin/databus/wiki Page 12 © Hortonworks Inc. 2012