Your SlideShare is downloading. ×
Big data beyond Hadoop –
How to integrate ALL your data
Kai Wähner
kwaehner@talend.com
@KaiWaehner
www.kai-waehner.de
9/24...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
Consulting
Developing
Coaching
Speak...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
Key messages
You have to care about ...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
• Big data paradigm shift
• Challeng...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
• Big data paradigm shift
• Challeng...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
William Edwards Deming
(1900 –1993)
...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
 „Silence the HiPPOs“ (highest-paid...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
What is big data? The Vs of big data...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
Big Data Integration
– Land data in ...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
“The advantage of their new system i...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
http://hkotadia.com/archives/5021
De...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
➜ With revenue of almost USD 30 bill...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
➜ A lot of data must be stored „fore...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
• Big data paradigm shift
• Challeng...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
This is your
company
Big Data Geek
L...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
Big Data + Poor Data Quality = Big P...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
➜ Wanna buy a big data solution for ...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
Looking for ‚your‘ required big data...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
How to solve these big data challeng...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
 “*Often+ simple models and
big dat...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
 Look at use cases of others
(SMU, ...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
1) Do not begin with the data, think...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
• Big data paradigm shift
• Challeng...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
Technology perspective
How to proces...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
The critical flaw in parallel ETL to...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
Slides: http://www.slideshare.net/pa...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
The defacto standard for big data pr...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
Even Microsoft (the .NET house) reli...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
Apache Hadoop, an open-source softwa...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
Simple example
• Input: (very large)...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
How to process big data?
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
• Big data paradigm shift
• Challeng...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
Connectivity
Routing
Transformation
...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
Complexity
of Integration
Enterprise...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
More details about integration frame...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
More details about integration frame...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
Enterprise Integration Patterns (EIP...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
Enterprise Integration Patterns (EIP)
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
Enterprise Integration Patterns (EIP)
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
Architecture
http://java.dzone.com/a...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
HTTP
FTP
File
XSLT
MQ
JDBC
Akka
TCP
...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
Choose your favorite DSL
XML
(not pr...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
Deploy it wherever you need
Standalo...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
Enterprise-ready
• Open Source
• Sca...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
Example: Camel integration route
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
Hadoop Integration with Apache Camel
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
camel-hdfs component
// Producer
fro...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
camel-hbase component
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
➜ A lot of data must be stored „fore...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
Real world use case: Storage: Compli...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
Live demo
Apache Camel in action...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
camel-pig? camel-hive? camel-hcatalo...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
camel-avro component
... Avro, a dat...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
• Big data paradigm shift
• Challeng...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
Connectivity
Routing
Transformation
...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
Complexity
of Integration
Enterprise...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
More details about ESBs and suites.....
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
Hadoop Integration with Talend Open ...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
…an open source
ecosystem
Talend Ope...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
…an open source
ecosystem
Talend Pla...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
Talend Open Studio for Big Data
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
“The advantage of their new system i...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
Real world use case: Clickstream Ana...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
One of the original uses of Hadoop a...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
Example: A semi-structured log file
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
Example: ETL Job
„... using Talend’s...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
Example: ETL Job
„... using Talend’s...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
„Talend Open Studio for Big Data“ in...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
Example: Analysis with Microsoft Exc...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
Example: Analysis with Microsoft Exc...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
Example: Analysis with Tableau
Spoil...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
• Big data paradigm shift
• Challeng...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
Custom components
Easy to realize fo...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
Custom components
You might need a ....
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
Live demo (Example: Apache Camel)
Cu...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
Alternative for custom components
• ...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
Code example: REST API for Salesforc...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
Did you get the key message?
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
Key messages
You have to care about ...
© Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner
Did you get the key message?
Thank you for your attention. Questions?
kwaehner@talend.com
www.kai-waehner.de
LinkedIn / Xing
@KaiWaehner
Upcoming SlideShare
Loading in...5
×

"Big Data beyond Apache Hadoop - How to Integrate ALL your Data" - JavaOne 2013

1,679

Published on

Big data represents a significant paradigm shift in enterprise technology. Big data radically changes the nature of the data management profession as it introduces new concerns about the volume, velocity and variety of corporate data.

Apache Hadoop is the open source defacto standard for implementing big data solutions on the Java platform. Hadoop consists of its kernel, MapReduce, and the Hadoop Distributed Filesystem (HDFS). A challenging task is to send all data to Hadoop for processing and storage (and then get it back to your application later), because in practice data comes from many different applications (SAP, Salesforce, Siebel, etc.) and databases (File, SQL, NoSQL), uses different technologies and concepts for communication (e.g. HTTP, FTP, RMI, JMS), and consists of different data formats using CSV, XML, binary data, or other alternatives.

This session shows different open source frameworks and products to solve this challenging task. Learn how to use every thinkable data with Hadoop – without plenty of complex or redundant boilerplate code.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,679
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of ""Big Data beyond Apache Hadoop - How to Integrate ALL your Data" - JavaOne 2013"

  1. 1. Big data beyond Hadoop – How to integrate ALL your data Kai Wähner kwaehner@talend.com @KaiWaehner www.kai-waehner.de 9/24/2013
  2. 2. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner Consulting Developing Coaching Speaking Writing Main Tasks Requirements Engineering Enterprise Architecture Management Business Process Management Architecture and Development of Applications Service-oriented Architecture Integration of Legacy Applications Cloud Computing Big Data Contact Email: kontakt@kai-waehner.de Blog: www.kai-waehner.de/blog Twitter: @KaiWaehner Social Networks: Xing, LinkedIn Kai Wähner
  3. 3. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner Key messages You have to care about big data to be competitive in the future! You have to integrate different sources to get most value out of it! Big data integration is no (longer) rocket science!
  4. 4. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner • Big data paradigm shift • Challenges of big data • Big data from a technology perspective • Integration with an open source framework • Integration with an open source suite • Custom big data components Agenda
  5. 5. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner • Big data paradigm shift • Challenges of big data • Big data from a technology perspective • Integration with an open source framework • Integration with an open source suite • Custom big data components Agenda
  6. 6. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner William Edwards Deming (1900 –1993) American statistician, professor, author, lecturer and consultant “If you can't measure it, you can't manage it.” Why should you care about big data?
  7. 7. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner  „Silence the HiPPOs“ (highest-paid person‘s opinion)  Being able to interpret unimaginable large data stream, the gut feeling is no longer justified! Why should you care about big data?
  8. 8. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner What is big data? The Vs of big data Volume (terabytes, petabytes) Variety (social networks, blog posts, logs, sensors, etc.) Velocity (realtime or near- realtime) Value
  9. 9. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner Big Data Integration – Land data in a Big Data cluster – Implement or generate parallel processes Big Data Manipulation – Simplify manipulation, such as sort and filter – Computational expensive functions Big Data Quality & Governance – Identify linkages and duplicates, validate big data – Match component, execute basic quality features Big Data Project Management – Place frameworks around big data projects – Common Repository, scheduling, monitoring Big data tasks to solve - before analysis
  10. 10. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner “The advantage of their new system is that they can now look at their data [from their log processing system] in anyway they want: ➜ Nightly MapReduce jobs collect statistics about their mail system such as spam counts by domain, bytes transferred and number of logins. ➜ When they wanted to find out which part of the world their customers logged in from, a quick [ad hoc] MapReduce job was created and they had the answer within a few hours. Not really possible in your typical ETL system.” http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-query-terabytes-data Use case: Clickstream Analysis
  11. 11. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner http://hkotadia.com/archives/5021 Deduce Customer Defections Use case: Risk management
  12. 12. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner ➜ With revenue of almost USD 30 billion and a network of 800 locations, Macy's is considered the largest store operator in the USA ➜ Daily price check analysis of its 10,000 articles in less than two hours ➜ Whenever a neighboring competitor anywhere between New York and Los Angeles goes for aggressive price reductions, Macy's follows its example ➜ If there is no market competitor, the prices remain unchanged http://www.t-systems.com/about-t-systems/examples-of-successes-companies-analyze-big-data-in-record-time-l-t-systems/1029702 Use case: Flexible pricing
  13. 13. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner ➜ A lot of data must be stored „forever“ ➜ Numbers increase exponentially ➜ Goal: As cheap as possible ➜ Problem: (Fast) queries must still be possible ➜ Solution: Commodity servers and „Hadoop querying“ Global Parcel Service http://archive.org/stream/BigDataImPraxiseinsatz-SzenarienBeispieleEffekte/Big_Data_BITKOM-Leitfaden_Sept.2012#page/n0/mode/2up Storage: Compliance
  14. 14. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner • Big data paradigm shift • Challenges of big data • Big data from a technology perspective • Integration with an open source framework • Integration with an open source suite • Custom big data components Agenda
  15. 15. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner This is your company Big Data Geek Limited big data experts
  16. 16. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner Big Data + Poor Data Quality = Big Problems Data quality
  17. 17. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner ➜ Wanna buy a big data solution for your industry? ➜ Maybe a competitor has a big data solution which adds business value? ➜ The competitor will never publish it (rat-race)! Big data tool selection (business perspective)
  18. 18. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner Looking for ‚your‘ required big data product? Support your data from scratch? Good luck!  Big data tool selection (technical perspective)
  19. 19. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner How to solve these big data challenges?
  20. 20. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner  “*Often+ simple models and big data trump more-elaborate [and complex] analytics approaches”  “Often someone coming from outside an industry can spot a better way to use big data than an insider” Erik Brynjolfsson / Lynn Wu http://alfredopassos.tumblr.com/post/32461599327/big-data-the-management-revolution-by-andrew-mcafee Be no expert! Be simple!
  21. 21. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner  Look at use cases of others (SMU, but also large companies)  How can you do something similar with your data?  You have different data sources? Use it! Combine it! Play with it! Be creative!
  22. 22. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner 1) Do not begin with the data, think about business opportunities 2) Choose the right data (combine different data sources) 3) Use easy tooling http://hbr.org/2012/10/making-advanced-analytics-work-for-you What is your Big Data process? Step 1 Step 2 Step 3
  23. 23. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner • Big data paradigm shift • Challenges of big data • Big data from a technology perspective • Integration with an open source framework • Integration with an open source suite • Custom big data components Agenda
  24. 24. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner Technology perspective How to process big data?
  25. 25. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner The critical flaw in parallel ETL tools is the fact that the data is almost never local to the processing nodes. This means that every time a large job is run, the data has to first be read from the source, split N ways and then delivered to the individual nodes. Worse, if the partition key of the source doesn’t match the partition key of the target, data has to be constantly exchanged among the nodes. In essence, parallel ETL treats the network as if it were a physical I/O subsystem. The network, which is always the slowest part of the process, becomes the weakest link in the performance chain. http://blog.syncsort.com/2012/08/parallel-etl-tools-are-dead How to process big data?
  26. 26. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner Slides: http://www.slideshare.net/pavlobaron/100-big-data-0-hadoop-0-java Video: http://www.infoq.com/presentations/Big-Data-Hadoop-Java How to process big data?
  27. 27. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner The defacto standard for big data processing How to process big data?
  28. 28. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner Even Microsoft (the .NET house) relies on Hadoop since 2011 How to process big data? “A big part of [the company’s strategy+ includes wiring SQL Server 2012 (formerly known by the codename “Denali”) to the Hadoop distributed computing platform, and bringing Hadoop to Windows Server and Azure”
  29. 29. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner Apache Hadoop, an open-source software library, is a framework that allows for the distributed processing of large data sets across clusters of commodity hardware using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. What is Hadoop?
  30. 30. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner Simple example • Input: (very large) text files with lists of strings, such as: „318, 0043012650999991949032412004...0500001N9+01111+99999999999...“ • We are interested just in some content: year and temperate (marked in red) • The Map Reduce function has to compute the maximum temperature for every year Example from the book “Hadoop: The Definitive Guide, 3rd Edition” Map (Shuffle) Reduce
  31. 31. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner How to process big data?
  32. 32. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner • Big data paradigm shift • Challenges of big data • Big data from a technology perspective • Integration with an open source framework • Integration with an open source suite • Custom big data components Agenda
  33. 33. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner Connectivity Routing Transformation Complexity of Integration Enterprise Service Bus Integration Suite Low High Integration Framework INTEGRATION Tooling Monitoring Support + BUSINESS PROCESS MGT. BIG DATA / MDM REGISTRY / REPOSITORY RULES ENGINE „YOU NAME IT“ + Alternatives for systems integration
  34. 34. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner Complexity of Integration Enterprise Service Bus Integration Suite Low High Integration Framework Alternatives for systems integration
  35. 35. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner More details about integration frameworks... http://www.kai-waehner.de/blog/2012/12/20/showdown-integration-framework- spring-integration-apache-camel-vs-enterprise-service-bus-esb/
  36. 36. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner More details about integration frameworks... ... or you come to my JavaOne session tomorrow and see an updated version of the slides! Wednesday, 11:30 – 12:30 PM CON1934: Which Integration Framework to Choose?
  37. 37. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner Enterprise Integration Patterns (EIP) Apache Camel Implements the EIPs
  38. 38. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner Enterprise Integration Patterns (EIP)
  39. 39. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner Enterprise Integration Patterns (EIP)
  40. 40. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner Architecture http://java.dzone.com/articles/apache-camel-integration
  41. 41. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner HTTP FTP File XSLT MQ JDBC Akka TCP SMTP RSS Quartz Log LDAP JMS EJB AMQP Atom AWS-S3 Bean-Validation CXF IRC Jetty JMX Lucene Netty RMI SQL Many many more Custom Components Choose your required components
  42. 42. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner Choose your favorite DSL XML (not production-ready yet)
  43. 43. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner Deploy it wherever you need Standalone OSGi Application Server Web Container Spring Container Cloud
  44. 44. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner Enterprise-ready • Open Source • Scalability • Error Handling • Transaction • Monitoring • Tooling • Commercial Support
  45. 45. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner Example: Camel integration route
  46. 46. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner Hadoop Integration with Apache Camel
  47. 47. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner camel-hdfs component // Producer from("ftp://user@myServer?password=secret") .to(“hdfs:///myDirectory/myFile.txt?append=true"); // Consumer from(“hdfs:///myDirectory/myBigDataAnalysis.csv") .to(“file:target/reports/report.csv");
  48. 48. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner camel-hbase component
  49. 49. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner ➜ A lot of data must be stored „forever“ ➜ Numbers increase exponentially ➜ Goal: As cheap as possible ➜ Problem: (Fast) queries must still be possible ➜ Solution: Commodity servers and „Hadoop querying“ Global Parcel Service http://archive.org/stream/BigDataImPraxiseinsatz-SzenarienBeispieleEffekte/Big_Data_BITKOM-Leitfaden_Sept.2012#page/n0/mode/2up Real World Use Case: Storage: Compliance
  50. 50. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner Real world use case: Storage: Compliance Orders (Server 1) Log Files (Server 3) Log Files (Server 100) ETL QueryStorage Payments (Server 2)
  51. 51. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner Live demo Apache Camel in action...
  52. 52. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner camel-pig? camel-hive? camel-hcatalog? Not available yet (current Camel version: 2.12)  Workarounds: • Use Pig / Hive-Query scripts (via camel-exec component or any scripting language) • Build your own component (more details later ...) • Use Hive-Hbase-Integration and store data in HBase ( „ugly“)
  53. 53. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner camel-avro component ... Avro, a data serialization system used in Apache Hadoop. camel-avro component provides a dataformat for Avro, which allows serialization and deserialization of messages using Apache Avro's binary data format. Moreover, it provides support for Apache Avro's RPC, by providing producers and consumers endpoint for using Avro over Netty or HTTP. Camel is not just about connectors ... Camel supports a pluggable DataFormat to allow messages to be marshalled to and unmarshalled from binary or text formats, e.g. CSV, JSON, SOAP, EDI, ZIP, or ...
  54. 54. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner • Big data paradigm shift • Challenges of big data • Big data from a technology perspective • Integration with an open source framework • Integration with an open source suite • Custom big data components Agenda
  55. 55. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner Connectivity Routing Transformation Complexity of Integration Enterprise Service Bus Integration Suite Low High Integration Framework INTEGRATION Tooling Monitoring Support + BUSINESS PROCESS MGT. BIG DATA / MDM REGISTRY / REPOSITORY RULES ENGINE „YOU NAME IT“ + Alternatives for systems integration
  56. 56. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner Complexity of Integration Enterprise Service Bus Integration Suite Low High Integration Framework Alternatives for systems integration
  57. 57. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner More details about ESBs and suites... http://www.kai-waehner.de/blog/2013/01/23/spoilt-for-choice- how-to-choose-the-right-enterprise-service-bus-esb/
  58. 58. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner Hadoop Integration with Talend Open Studio
  59. 59. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner …an open source ecosystem Talend Open Studio for Big Data • Improves efficiency of big data job design with graphic interface • Generates Hadoop code and run transforms inside Hadoop • Native support for HDFS, Pig, Hbase, Hcatalog, Sqoop and Hive • 100% open source under an Apache License • Standards based Pig Vision: Democratize big data
  60. 60. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner …an open source ecosystem Talend Platform for Big Data • Builds on Talend Open Studio for Big Data • Adds data quality, advanced scalability and management functions • MapReduce massively parallel data processing • Shared Repository and remote deployment • Data quality and profiling • Data cleansing • Reporting and dashboards • Commercial support, warranty/IP indemnity under a subscription license Pig Vision: Democratize big data
  61. 61. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner Talend Open Studio for Big Data
  62. 62. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner “The advantage of their new system is that they can now look at their data [from their log processing system] in anyway they want: ➜ Nightly MapReduce jobs collect statistics about their mail system such as spam counts by domain, bytes transferred and number of logins. ➜ When they wanted to find out which part of the world their customers logged in from, a quick [ad hoc] MapReduce job was created and they had the answer within a few hours. Not really possible in your typical ETL system.” http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-query-terabytes-data Real world Use case: Clickstream Analysis
  63. 63. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner Real world use case: Clickstream Analysis Log Files (Server 1) Log Files (Server 2) Log Files (Server 100) ETL
  64. 64. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner One of the original uses of Hadoop at Yahoo was to store and process their massive volume of clickstream data. Now enterprises of all types can use Hadoop to refine and analyze clickstream data. They can then answer business questions such as: • What is the most efficient path for a site visitor to research a product, and then buy it? • What products do visitors tend to buy together, and what are they most likely to buy in the future? • Where should I spend resources on fixing or enhancing the user experience on my website? Goal: Data visualization can help you optimize your website and convert more visits into sales and revenue. Potential Uses of Clickstream Data Source: for Clickstream Example: „Hortonworks Hadoop Tutorials - Real Life Use Cases” http://hortonworks.com/blog/hadoop-tutorials-real-life-use-cases-in-the-sandbox
  65. 65. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner Example: A semi-structured log file
  66. 66. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner Example: ETL Job „... using Talend’s HDFS and Hive Components”
  67. 67. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner Example: ETL Job „... using Talend’s Map Reduce Components*” * Not available in open source version of Talend Studio
  68. 68. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner „Talend Open Studio for Big Data“ in action... Live demo
  69. 69. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner Example: Analysis with Microsoft Excel We can see that the largest number of page hits in Florida were for clothing, followed by shoes. Source: for Clickstream Example: „Hortonworks Hadoop Tutorials - Real Life Use Cases” http://hortonworks.com/blog/hadoop-tutorials-real-life-use-cases-in-the-sandbox
  70. 70. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner Example: Analysis with Microsoft Excel The chart shows that the majority of men shopping for clothing on our website are between the ages of 22 and 30. With this information, we can optimize our content for this market segment. Source: for Clickstream Example: „Hortonworks Hadoop Tutorials - Real Life Use Cases” http://hortonworks.com/blog/hadoop-tutorials-real-life-use-cases-in-the-sandbox
  71. 71. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner Example: Analysis with Tableau Spoilt for Choice  Use your preferred BI or Analysis tool!
  72. 72. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner • Big data paradigm shift • Challenges of big data • Big data from a technology perspective • Integration with an open source framework • Integration with an open source suite • Custom big data components Agenda
  73. 73. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner Custom components Easy to realize for all integration alternatives * • Integration Framework • Enterprise Service Bus • Integration Suite * At least for open source solutions
  74. 74. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner Custom components You might need a ... • ... Hive component for Camel • ... Impala component for Talend • ... custom component for your internal data format
  75. 75. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner Live demo (Example: Apache Camel) Custom components in action...
  76. 76. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner Alternative for custom components • SOAP • REST
  77. 77. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner Code example: REST API for Salesforce object store // Salesforce Query (SOQL) via REST API from("direct:salesforceViaHttpLIST") .setHeader("X-PrettyPrint", 1) .setHeader("Authorization", accessToken) .setHeader(Exchange.CONTENT_TYPE, "application/json") .to("https://na14.salesforce.com/services/data/v20.0/query?q=SELECT+name+from +Article__c") // Salesforce CREATE via REST API from("direct:salesforceViaHttpCREATE") .setHeader("X-PrettyPrint", 1) .setHeader("Authorization", accessToken) .setHeader(Exchange.CONTENT_TYPE, "application/json“) .to("https://na14.salesforce.com/services/data/v20.0/sobjects/Article__c")
  78. 78. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner Did you get the key message?
  79. 79. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner Key messages You have to care about big data to be competitive in the future! You have to integrate different sources to get most value out of it! Big data integration is no (longer) rocket science!
  80. 80. © Talend 2013 “Big Data beyond Hadoop – How to integrate ALL your Data” by Kai Wähner Did you get the key message?
  81. 81. Thank you for your attention. Questions? kwaehner@talend.com www.kai-waehner.de LinkedIn / Xing @KaiWaehner

×