Your SlideShare is downloading. ×
2013 march 26_thug_etl_cdc_talking_points
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

2013 march 26_thug_etl_cdc_talking_points


Published on

Some diagrams for our roundtable on modern ETL/CDC with Hadoop and other new technologies

Some diagrams for our roundtable on modern ETL/CDC with Hadoop and other new technologies

  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Data Integration in 2013: A working session Adam Muise March 26 2013Note: This deck is purposely sparse. Want value?Join the conversation in the Toronto Hadoop UserGroup: © Hortonworks Inc. 2012
  • 2. Proposed Agenda• Introductions• Discuss common Data Integration Patterns• Round-table of User Group Member CDC/ETL Use Cases• New Data Integration Solutions: A change from the Old Guard: – Hadoop and the Data Lake – Streaming (+ Hadoop) – Data Lake Governance / Management (InfoTrellis) – Databus (LinkedIn) Page 2 © Hortonworks Inc. 2012
  • 3. IntroductionsWho let you in? Page 3 © Hortonworks Inc. 2012
  • 4. General Data Integration Patterns• Enterprise Application Integration* – Metadata lookup – Validation – Extra-app communication• Enterprise Service Bus (SOA, Message Bus/Hub)*• Federation* – Bridging multiple databases with a query layer – Eg: Composite• Extract Transform Load (ETL)* – Collection – Aggregation – Format/Schema transformation• Data Lake – Landing Zone for multiple datasets in one store – Mixed schema, often raw structured/unstructured data – Eg: Hadoop* Source: Data Integration Blueprint and Modeling: Techniques for a Scalable and Sustainable Architecture, Anthony David Giordano, 2010, IBM Press. Page 4 © Hortonworks Inc. 2012
  • 5. Use Case RoundtableData that’s keeping you up at night… Page 5 © Hortonworks Inc. 2012
  • 6. Scotia iTrade: Geoffrey Li Page 6 © Hortonworks Inc. 2012
  • 7. New Data Integration SolutionsFresh Ideas to new and old problems… Page 7 © Hortonworks Inc. 2012
  • 8. Hadoop: The Data Lake Publish Event Signal Data Transformation Model/ Transform & Apply Metadata Aggregate Publish Exchange Explore Visualize Extract & Report Load Analyze Page 8 © Hortonworks Inc. 2012
  • 9. Streaming & Hadoop Page 9 © Hortonworks Inc. 2012
  • 10. Streaming & Hadoop Page 10 © Hortonworks Inc. 2012
  • 11. DataBus (LinkedIn)Databus is a low latency change capture system which has become anintegral part of LinkedIn’s data processing pipeline. Databus addresses afundamental requirement to reliably capture, flow and processes primarydata changes. Databus provides the following features: 1. Isolation between sources and consumers 2. Guaranteed in order and at least once delivery with high availability 3. Consumption from an arbitrary time point in the change stream including full bootstrap capability of the entire data. 4. Partitioned consumption 5. Source consistency preservation Page 11 © Hortonworks Inc. 2012
  • 12. DataBus (LinkedIn) Page 12 © Hortonworks Inc. 2012