Your SlideShare is downloading. ×
0
Kai Wähner| Talend
@KaiWaehner
www.kai-waehner.de/blog
Xing / LinkedIn

Big Data beyond Hadoop
How to integrate ALL your D...
Kai Wähner
Main Tasks
Requirements Engineering
Enterprise Architecture Management
Business Process Management
Architecture...
Key messages

You have to care about big data to be competitive in the future!
You have to integrate different sources to ...
Agenda
• Big data paradigm shift
• Challenges for integrating big data
• Choosing Hadoop for big data integration
• Integr...
Agenda
• Big data paradigm shift
• Challenges for integrating big data
• Choosing Hadoop for big data integration
• Integr...
Why should you care about big data?

“If you can't measure it,
you can't manage it.”
William Edwards Deming
(1900 –1993)
A...
Why should you care about big data?

„Silence the HiPPOs“ (highest-paid person‘s opinion)
 Being able to interpret unimag...
What is big data? The Vs of big data

Volume
(terabytes,
petabytes)

Velocity
(realtime or nearrealtime)

Variety
(social ...
Big data tasks to solve - before analysis
Big Data Integration
– Land data in a Big Data cluster
– Implement or generate p...
Use case: Clickstream Analysis

“The advantage of their new system is that they can now look at their data
[from their log...
Use case: Risk management

Deduce
Customer
Defections

http://hkotadia.com/archives/5021

© Talend 2013

“Big Data beyond ...
Use case: Flexible pricing

➜

➜
➜

➜

With revenue of almost USD 30 billion and a network of
800 locations, Macy's is con...
Storage: Compliance

Global Parcel Service
A lot of data must be stored „forever“
➜ Numbers increase exponentially
➜ Goal:...
Agenda
• Big data paradigm shift
• Challenges for integrating big data
• Choosing Hadoop for big data integration
• Integr...
Limited big data experts

This is your
company
Big Data Geek
© Talend 2013

“Big Data beyond Hadoop – How to integrate ALL...
Big data tool selection (business perspective)
Wanna buy a big data solution for your industry?
➜ Maybe a competitor has a...
Big data tool selection (technical perspective)

Looking for ‚your‘ required big data product?
Support your data from scra...
Data quality

Big Data + Poor Data Quality = Big Problems

© Talend 2013

“Big Data beyond Hadoop – How to integrate ALL y...
How to solve these big data challenges?

© Talend 2013

“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai W...
What is your Big Data process?

Step 1

Step 2

Step 3

1) Do not begin with the data, think about business opportunities
...
Agenda
• Big data paradigm shift
• Challenges for integrating big data
• Choosing Hadoop for big data integration
• Integr...
Technology perspective

How to process big data?
© Talend 2013

“Big Data beyond Hadoop – How to integrate ALL your Data” ...
How to process big data?

The critical flaw in parallel ETL tools is the fact that the data is almost never local to the p...
How to process big data?

Slides: http://www.slideshare.net/pavlobaron/100-big-data-0-hadoop-0-java
Video: http://www.info...
How to process big data?

The defacto standard for big data processing

© Talend 2013

“Big Data beyond Hadoop – How to in...
How to process big data?

“A big part of [the
company’s strategy]
includes wiring SQL Server
2012 (formerly known by
the c...
What is Hadoop?

Apache Hadoop, an open-source software library, is a
framework that allows for the distributed processing...
Map (Shuffle) Reduce
Simple example
• Input: (very large) text files with lists of strings, such as:
„318, 004301265099999...
How to process big data?

© Talend 2013

“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
Agenda
• Big data paradigm shift
• Challenges for integrating big data
• Choosing Hadoop for big data integration
• Integr...
Alternatives for systems integration

Integration Suite

Enterprise
Service Bus

Integration
Framework

Low

Connectivity
...
Alternatives for systems integration

Integration
Framework

Enterprise
Service Bus

Low

© Talend 2013

Integration Suite...
More details about integration frameworks...

http://www.kai-waehner.de/blog/2012/12/20/showdown-integration-frameworkspri...
Enterprise Integration Patterns (EIP)

Apache Camel
Implements the EIPs
© Talend 2013

“Big Data beyond Hadoop – How to in...
Enterprise Integration Patterns (EIP)

© Talend 2013

“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wäh...
Enterprise Integration Patterns (EIP)

© Talend 2013

“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wäh...
Architecture

http://java.dzone.com/articles/apache-camel-integration

© Talend 2013

“Big Data beyond Hadoop – How to int...
Choose your required connectors
TCP

SQL

SMTP

Netty
RMI

FTP

Lucene

JMX

MQ

Atom
LDAP

Quartz

XSLT

Akka
Many many m...
Choose your favorite DSL

XML

(not production-ready yet)

© Talend 2013

“Big Data beyond Hadoop – How to integrate ALL y...
Deploy it wherever you need

Standalone

Application Server

Web Container
Spring Container
OSGi
Cloud
© Talend 2013

“Big...
Enterprise-ready

• Open Source
• Scalability
• Error Handling
• Transaction
• Monitoring
• Tooling
• Commercial Support
©...
Example: Camel integration route

© Talend 2013

“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
Hadoop Integration with Apache Camel

© Talend 2013

“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähn...
camel-hdfs connector (Java DSL)
// Producer
from("ftp://user@myServer?password=secret")
.to(“hdfs:///myDirectory/myFile.tx...
camel-hbase connector (XML DSL)

© Talend 2013

“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
Camel and Hadoop File Formats
Choose the appropriate Hadoop file format, e.g. SequenceFile, MapFile, etc...

© Talend 2013...
camel-avro connector
Camel is not just about connectors ...
Camel supports a pluggable DataFormat to allow messages to be ...
Real World Use Case: Storage: Compliance

Global Parcel Service
A lot of data must be stored „forever“
➜ Numbers increase ...
Real world use case: Storage: Compliance
Orders
(Server 1)

Payments
(Server 2)

ETL

Log Files
(Server 3)

Log Files
(Ser...
Live demo

Apache Camel in action...
© Talend 2013

“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähne...
camel-pig? camel-hive? camel-hcatalog?
Not available yet (current Camel version: 2.12)  Workarounds:
•

•
•

Use Pig / Hi...
Agenda
• Big data paradigm shift
• Challenges for integrating big data
• Choosing Hadoop for big data integration
• Integr...
Alternatives for systems integration

Integration Suite

Enterprise
Service Bus

Integration
Framework

Low

Connectivity
...
Alternatives for systems integration

Integration
Framework

Enterprise
Service Bus

Low

© Talend 2013

Integration Suite...
More details about ESBs and suites...

http://www.kai-waehner.de/blog/2013/01/23/spoilt-for-choicehow-to-choose-the-right-...
Hadoop Integration with Talend Open Studio

© Talend 2013

“Big Data beyond Hadoop – How to integrate ALL your Data” by Ka...
Vision: Democratize big data
Talend Open Studio for Big Data
•

•

…an open source
ecosystem
© Talend 2013

Generates Hado...
Vision: Democratize big data
Talend Platform for Big Data
•

Builds on Talend Open Studio for Big Data

•

Adds data quali...
Talend Open Studio for Big Data

© Talend 2013

“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
Real world Use case: Clickstream Analysis

“The advantage of their new system is that they can now look at their data
[fro...
Example: A semi-structured log file

© Talend 2013

“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähne...
Potential Uses of Clickstream Data
One of the original uses of Hadoop at Yahoo was to store and process their massive volu...
Real world use case: Clickstream Analysis

Log Files
(Server 1)

ETL

Log Files
(Server 2)

Log Files
(Server 100)
© Talen...
Example: ETL Job

„... using Talend’s HDFS and Hive Components”
© Talend 2013

“Big Data beyond Hadoop – How to integrate ...
Example: ETL Job

„... using Talend’s Map Reduce Components*”
* Not available in open source version of Talend Studio
© Ta...
Example: Analysis with Microsoft Excel

We can see that the largest number of page hits in Florida were for
clothing, foll...
Example: Analysis with Microsoft Excel

The chart shows that the majority of men shopping for clothing on our
website are ...
Example: Analysis with Tableau

Spoilt for Choice  Use your preferred BI or Analysis tool!
© Talend 2013

“Big Data beyon...
Live demo

„Talend Open Studio for Big Data“ in action...
© Talend 2013

“Big Data beyond Hadoop – How to integrate ALL yo...
Agenda
• Big data paradigm shift
• Challenges for integrating big data
• Choosing Hadoop for big data integration
• Integr...
Custom connectors

Easy to realize for all
integration alternatives *
•

Integration Framework

•

Enterprise Service Bus
...
Custom connectors

You might need a ...
•

... Hive connector for Camel

•

... Impala connector for Talend

•

... custom...
Live demo (Example: Apache Camel)

Custom connectors in action...
© Talend 2013

“Big Data beyond Hadoop – How to integrat...
Did you get the key message?

© Talend 2013

“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
Key messages

You have to care about big data to be competitive in the future!
You have to integrate different sources to ...
Did you get the key message?

© Talend 2013

“Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
Thank you for your attention.

Questions?
Kai Wähner| Talend
@KaiWaehner
www.kai-waehner.de/blog
Xing / LinkedIn
Upcoming SlideShare
Loading in...5
×

WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend

1,615

Published on

Big data represents a significant paradigm shift in enterprise technology. Big data radically changes the nature of the data management profession as it introduces new concerns about the volume, velocity and variety of corporate data. Apache Hadoop is the open source defacto standard for implementing big data solutions on the Java platform. Hadoop consists of its kernel, MapReduce, and the Hadoop Distributed Filesystem (HDFS). A challenging task is to send all data to Hadoop for processing and storage (and then get it back to your application later), because in practice data comes from many different applications (SAP, Salesforce, Siebel, etc.) and databases (File, SQL, NoSQL), uses different technologies and concepts for communication (e.g. HTTP, FTP, RMI, JMS), and consists of different data formats using CSV, XML, binary data, or other alternatives. This session shows different open source frameworks and products to solve this challenging task. Learn how to use every thinkable data with Hadoop – without plenty of complex or redundant boilerplate code.

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,615
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
64
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL your Data with Camel and Talend"

  1. 1. Kai Wähner| Talend @KaiWaehner www.kai-waehner.de/blog Xing / LinkedIn Big Data beyond Hadoop How to integrate ALL your Data
  2. 2. Kai Wähner Main Tasks Requirements Engineering Enterprise Architecture Management Business Process Management Architecture and Development of Applications Service-oriented Architecture Integration of Legacy Applications Cloud Computing Big Data Consulting Developing Coaching Speaking Writing © Talend 2013 Contact Email: kontakt@kai-waehner.de Blog: www.kai-waehner.de/blog Twitter: @KaiWaehner Social Networks: Xing, LinkedIn “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  3. 3. Key messages You have to care about big data to be competitive in the future! You have to integrate different sources to get most value out of it! Big data integration is no (longer) rocket science! © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  4. 4. Agenda • Big data paradigm shift • Challenges for integrating big data • Choosing Hadoop for big data integration • Integration with an open source framework • Integration with an open source suite • Custom big data connectors © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  5. 5. Agenda • Big data paradigm shift • Challenges for integrating big data • Choosing Hadoop for big data integration • Integration with an open source framework • Integration with an open source suite • Custom big data connectors © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  6. 6. Why should you care about big data? “If you can't measure it, you can't manage it.” William Edwards Deming (1900 –1993) American statistician, professor, author, lecturer and consultant © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  7. 7. Why should you care about big data? „Silence the HiPPOs“ (highest-paid person‘s opinion)  Being able to interpret unimaginable large data stream, the gut feeling is no longer justified!  © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  8. 8. What is big data? The Vs of big data Volume (terabytes, petabytes) Velocity (realtime or nearrealtime) Variety (social networks, blog posts, logs, sensors, etc.) © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner Value
  9. 9. Big data tasks to solve - before analysis Big Data Integration – Land data in a Big Data cluster – Implement or generate parallel processes Big Data Manipulation – Simplify manipulation, such as sort and filter – Computational expensive functions Big Data Quality & Governance – Identify linkages and duplicates, validate big data – Match connector, execute basic quality features Big Data Project Management – Place frameworks around big data projects – Common Repository, scheduling, monitoring © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  10. 10. Use case: Clickstream Analysis “The advantage of their new system is that they can now look at their data [from their log processing system] in anyway they want: ➜ Nightly MapReduce jobs collect statistics about their mail system such as spam counts by domain, bytes transferred and number of logins. ➜ When they wanted to find out which part of the world their customers logged in from, a quick [ad hoc] MapReduce job was created and they had the answer within a few hours. Not really possible in your typical ETL system.” http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-query-terabytes-data © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  11. 11. Use case: Risk management Deduce Customer Defections http://hkotadia.com/archives/5021 © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  12. 12. Use case: Flexible pricing ➜ ➜ ➜ ➜ With revenue of almost USD 30 billion and a network of 800 locations, Macy's is considered the largest store operator in the USA Daily price check analysis of its 10,000 articles in less than two hours Whenever a neighboring competitor anywhere between New York and Los Angeles goes for aggressive price reductions, Macy's follows its example If there is no market competitor, the prices remain unchanged http://www.t-systems.com/about-t-systems/examples-of-successes-companies-analyze-big-data-in-record-time-l-t-systems/1029702 © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  13. 13. Storage: Compliance Global Parcel Service A lot of data must be stored „forever“ ➜ Numbers increase exponentially ➜ Goal: As cheap as possible ➜ Problem: (Fast) queries must still be possible ➜ Solution: Commodity servers and „Hadoop querying“ ➜ http://archive.org/stream/BigDataImPraxiseinsatz-SzenarienBeispieleEffekte/Big_Data_BITKOM-Leitfaden_Sept.2012#page/n0/mode/2up © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  14. 14. Agenda • Big data paradigm shift • Challenges for integrating big data • Choosing Hadoop for big data integration • Integration with an open source framework • Integration with an open source suite • Custom big data connectors © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  15. 15. Limited big data experts This is your company Big Data Geek © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  16. 16. Big data tool selection (business perspective) Wanna buy a big data solution for your industry? ➜ Maybe a competitor has a big data solution which adds business value? ➜ The competitor will never publish it (rat-race)! ➜ © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  17. 17. Big data tool selection (technical perspective) Looking for ‚your‘ required big data product? Support your data from scratch? Good luck!  © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  18. 18. Data quality Big Data + Poor Data Quality = Big Problems © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  19. 19. How to solve these big data challenges? © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  20. 20. What is your Big Data process? Step 1 Step 2 Step 3 1) Do not begin with the data, think about business opportunities 2) Choose the right data (combine different data sources) 3) Use easy tooling http://hbr.org/2012/10/making-advanced-analytics-work-for-you © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  21. 21. Agenda • Big data paradigm shift • Challenges for integrating big data • Choosing Hadoop for big data integration • Integration with an open source framework • Integration with an open source suite • Custom big data connectors © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  22. 22. Technology perspective How to process big data? © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  23. 23. How to process big data? The critical flaw in parallel ETL tools is the fact that the data is almost never local to the processing nodes. This means that every time a large job is run, the data has to first be read from the source, split N ways and then delivered to the individual nodes. Worse, if the partition key of the source doesn’t match the partition key of the target, data has to be constantly exchanged among the nodes. In essence, parallel ETL treats the network as if it were a physical I/O subsystem. The network, which is always the slowest part of the process, becomes the weakest link in the performance chain. http://blog.syncsort.com/2012/08/parallel-etl-tools-are-dead © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  24. 24. How to process big data? Slides: http://www.slideshare.net/pavlobaron/100-big-data-0-hadoop-0-java Video: http://www.infoq.com/presentations/Big-Data-Hadoop-Java © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  25. 25. How to process big data? The defacto standard for big data processing © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  26. 26. How to process big data? “A big part of [the company’s strategy] includes wiring SQL Server 2012 (formerly known by the codename “Denali”) to the Hadoop distributed computing platform, and bringing Hadoop to Windows Server and Azure” Even Microsoft (the .NET house) relies on Hadoop since 2011 © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  27. 27. What is Hadoop? Apache Hadoop, an open-source software library, is a framework that allows for the distributed processing of large data sets across clusters of commodity hardware using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  28. 28. Map (Shuffle) Reduce Simple example • Input: (very large) text files with lists of strings, such as: „318, 0043012650999991949032412004...0500001N9+01111+99999999999...“ • We are interested just in some content: year and temperate (marked in red) • The Map Reduce function has to compute the maximum temperature for every year Example from the book “Hadoop: The Definitive Guide, 3rd Edition” © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  29. 29. How to process big data? © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  30. 30. Agenda • Big data paradigm shift • Challenges for integrating big data • Choosing Hadoop for big data integration • Integration with an open source framework • Integration with an open source suite • Custom big data connectors © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  31. 31. Alternatives for systems integration Integration Suite Enterprise Service Bus Integration Framework Low Connectivity Routing Transformation © Talend 2013 High + INTEGRATION Tooling Monitoring Support + BUSINESS PROCESS MGT. BIG DATA / MDM REGISTRY / REPOSITORY RULES ENGINE „YOU NAME IT“ “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner Complexity of Integration
  32. 32. Alternatives for systems integration Integration Framework Enterprise Service Bus Low © Talend 2013 Integration Suite High “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner Complexity of Integration
  33. 33. More details about integration frameworks... http://www.kai-waehner.de/blog/2012/12/20/showdown-integration-frameworkspring-integration-apache-camel-vs-enterprise-service-bus-esb/ © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  34. 34. Enterprise Integration Patterns (EIP) Apache Camel Implements the EIPs © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  35. 35. Enterprise Integration Patterns (EIP) © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  36. 36. Enterprise Integration Patterns (EIP) © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  37. 37. Architecture http://java.dzone.com/articles/apache-camel-integration © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  38. 38. Choose your required connectors TCP SQL SMTP Netty RMI FTP Lucene JMX MQ Atom LDAP Quartz XSLT Akka Many many more © Talend 2013 IRC Log HTTP File EJB AMQP RSS AWS-S3 Jetty JDBC Bean-Validation JMS CXF “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner Custom Components
  39. 39. Choose your favorite DSL XML (not production-ready yet) © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  40. 40. Deploy it wherever you need Standalone Application Server Web Container Spring Container OSGi Cloud © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  41. 41. Enterprise-ready • Open Source • Scalability • Error Handling • Transaction • Monitoring • Tooling • Commercial Support © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  42. 42. Example: Camel integration route © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  43. 43. Hadoop Integration with Apache Camel © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  44. 44. camel-hdfs connector (Java DSL) // Producer from("ftp://user@myServer?password=secret") .to(“hdfs:///myDirectory/myFile.txt?append=true"); // Consumer from(“hdfs:///myDirectory/myBigDataAnalysis.csv") .to(“file:target/reports/report.csv"); © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  45. 45. camel-hbase connector (XML DSL) © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  46. 46. Camel and Hadoop File Formats Choose the appropriate Hadoop file format, e.g. SequenceFile, MapFile, etc... © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  47. 47. camel-avro connector Camel is not just about connectors ... Camel supports a pluggable DataFormat to allow messages to be marshalled to and unmarshalled from binary or text formats, e.g. CSV, JSON, SOAP, EDI, ZIP, or ... ... Avro, a data serialization system used in Apache Hadoop. camel-avro connector provides a dataformat for Avro, which allows serialization and deserialization of messages using Apache Avro's binary data format. Moreover, it provides support for Apache Avro's RPC, by providing producers and consumers endpoint for using Avro over Netty or HTTP. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  48. 48. Real World Use Case: Storage: Compliance Global Parcel Service A lot of data must be stored „forever“ ➜ Numbers increase exponentially ➜ Goal: As cheap as possible ➜ Problem: (Fast) queries must still be possible ➜ Solution: Commodity servers and „Hadoop querying“ ➜ http://archive.org/stream/BigDataImPraxiseinsatz-SzenarienBeispieleEffekte/Big_Data_BITKOM-Leitfaden_Sept.2012#page/n0/mode/2up © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  49. 49. Real world use case: Storage: Compliance Orders (Server 1) Payments (Server 2) ETL Log Files (Server 3) Log Files (Server 100) © Talend 2013 Storage “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner Query
  50. 50. Live demo Apache Camel in action... © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  51. 51. camel-pig? camel-hive? camel-hcatalog? Not available yet (current Camel version: 2.12)  Workarounds: • • • Use Pig / Hive-Query scripts (via camel-exec connector or any scripting language) Build your own connector (more details later ...) Use Hive-Hbase-Integration and store data in HBase ( „ugly“) © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  52. 52. Agenda • Big data paradigm shift • Challenges for integrating big data • Choosing Hadoop for big data integration • Integration with an open source framework • Integration with an open source suite • Custom big data connectors © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  53. 53. Alternatives for systems integration Integration Suite Enterprise Service Bus Integration Framework Low Connectivity Routing Transformation © Talend 2013 High + INTEGRATION Tooling Monitoring Support + BUSINESS PROCESS MGT. BIG DATA / MDM REGISTRY / REPOSITORY RULES ENGINE „YOU NAME IT“ “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner Complexity of Integration
  54. 54. Alternatives for systems integration Integration Framework Enterprise Service Bus Low © Talend 2013 Integration Suite High “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner Complexity of Integration
  55. 55. More details about ESBs and suites... http://www.kai-waehner.de/blog/2013/01/23/spoilt-for-choicehow-to-choose-the-right-enterprise-service-bus-esb/ © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  56. 56. Hadoop Integration with Talend Open Studio © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  57. 57. Vision: Democratize big data Talend Open Studio for Big Data • • …an open source ecosystem © Talend 2013 Generates Hadoop code and run transforms inside Hadoop • Native support for HDFS, Pig, Hbase, Hcatalog, Sqoop and Hive • Pig Improves efficiency of big data job design with graphic interface 100% open source under an Apache License • Standards based “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  58. 58. Vision: Democratize big data Talend Platform for Big Data • Builds on Talend Open Studio for Big Data • Adds data quality, advanced scalability and management functions • • © Talend 2013 Data cleansing • • Data quality and profiling • …an open source ecosystem Shared Repository and remote deployment • Pig MapReduce massively parallel data processing Reporting and dashboards Commercial support, warranty/IP indemnity under a subscription license “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  59. 59. Talend Open Studio for Big Data © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  60. 60. Real world Use case: Clickstream Analysis “The advantage of their new system is that they can now look at their data [from their log processing system] in anyway they want: ➜ Nightly MapReduce jobs collect statistics about their mail system such as spam counts by domain, bytes transferred and number of logins. ➜ When they wanted to find out which part of the world their customers logged in from, a quick [ad hoc] MapReduce job was created and they had the answer within a few hours. Not really possible in your typical ETL system.” http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-query-terabytes-data © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  61. 61. Example: A semi-structured log file © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  62. 62. Potential Uses of Clickstream Data One of the original uses of Hadoop at Yahoo was to store and process their massive volume of clickstream data. Now enterprises of all types can use Hadoop to refine and analyze clickstream data. They can then answer business questions such as: • • • What is the most efficient path for a site visitor to research a product, and then buy it? What products do visitors tend to buy together, and what are they most likely to buy in the future? Where should I spend resources on fixing or enhancing the user experience on my website? Goal: Data visualization can help you optimize your website and convert more visits into sales and revenue. Source: for Clickstream Example: „Hortonworks Hadoop Tutorials - Real Life Use Cases” http://hortonworks.com/blog/hadoop-tutorials-real-life-use-cases-in-the-sandbox © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  63. 63. Real world use case: Clickstream Analysis Log Files (Server 1) ETL Log Files (Server 2) Log Files (Server 100) © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  64. 64. Example: ETL Job „... using Talend’s HDFS and Hive Components” © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  65. 65. Example: ETL Job „... using Talend’s Map Reduce Components*” * Not available in open source version of Talend Studio © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  66. 66. Example: Analysis with Microsoft Excel We can see that the largest number of page hits in Florida were for clothing, followed by shoes. Source: for Clickstream Example: „Hortonworks Hadoop Tutorials - Real Life Use Cases” http://hortonworks.com/blog/hadoop-tutorials-real-life-use-cases-in-the-sandbox © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  67. 67. Example: Analysis with Microsoft Excel The chart shows that the majority of men shopping for clothing on our website are between the ages of 22 and 30. With this information, we can optimize our content for this market segment. Source: for Clickstream Example: „Hortonworks Hadoop Tutorials - Real Life Use Cases” http://hortonworks.com/blog/hadoop-tutorials-real-life-use-cases-in-the-sandbox © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  68. 68. Example: Analysis with Tableau Spoilt for Choice  Use your preferred BI or Analysis tool! © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  69. 69. Live demo „Talend Open Studio for Big Data“ in action... © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  70. 70. Agenda • Big data paradigm shift • Challenges for integrating big data • Choosing Hadoop for big data integration • Integration with an open source framework • Integration with an open source suite • Custom big data connectors © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  71. 71. Custom connectors Easy to realize for all integration alternatives * • Integration Framework • Enterprise Service Bus • Integration Suite * At least for open source solutions © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  72. 72. Custom connectors You might need a ... • ... Hive connector for Camel • ... Impala connector for Talend • ... custom connector for your internal data format © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  73. 73. Live demo (Example: Apache Camel) Custom connectors in action... © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  74. 74. Did you get the key message? © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  75. 75. Key messages You have to care about big data to be competitive in the future! You have to integrate different sources to get most value out of it! Big data integration is no (longer) rocket science! © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  76. 76. Did you get the key message? © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
  77. 77. Thank you for your attention. Questions? Kai Wähner| Talend @KaiWaehner www.kai-waehner.de/blog Xing / LinkedIn
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×