1. About Talend Corporation and Their Journey:
•Talend was founded in 2005 and the first product Talend open studio for Data Integration
was launched in October 2006.
•Talend is sponsoring to many open source technology foundations Apache, Eclipse ...
•Talend currently employs engineers to work on Apache projects like Apache Karaf, Active
MQ, Hadoop..
•Talend has total 600+ employees base and it has total 1300 enterprise customers across
range of Industries.
What is Talend?
• Talend is an open source project for data integration studio based on Eclipse IDE.
• Talend studio is a dynamic java or Perl or Mapreduce code generator for the respective
job design.
• Jobs created in Talend studio can be executed from within the studio or as a standalone
JAR file from external programs.
• Talend jobs can be easily embedded in custom applications or we can create custom
components in Talend, based on external application related jar files.
Products under Talend Platform:
open source Edition – Data Integration, Data Quality, ESB, MDM, and Big Data
Enterprise Edition – All open source products with additional features plus – Real-Time Big
data, Cloud integration, Meta data manager and Data fabric
2. Advantages of using Talend over other competitor integration tools?
1. All other competitor integration tools are very expensive and not yet matured for big data
space
2. Talend is an open source DI tool, one can initiate the project with out any budget for ETL tool
3. Talend has 900+ connectors including technologies like Bonita BPM, EXASOL in-memory db….
4. Leverage HDFS,Pig,Sqoop,Hbase,MapReduce and Hive for ETL without having the core
programming depth.
5. Can extend the Data quality, Master data management capabilities to Big data platform also.
Why Talend is getting popularity in the current trend?
•Horizontal resource scalability at runtime
•Layer of abstraction
•Breadth of functionality
•Ease of deployment and management
3. 1).Horizontal resource scalability with Runtime servers:
We can deploy Talend jobs to AWS EC2 servers to execute the jobs using Talend Container
Service and right after completion of job execution we can terminate the lease of EC2 instance.
4.
5. 2).Layer of abstraction:
with one click we can change the execution engine from Mapreduce to Spark. It is just a configuration option change.
3). Breadth of functionality: In Talend you will be using same type of Job designer product for Data Integration and Big Data
endition.But in other tools we have to use different tool set for designing big data jobs.
4). Ease of deployment and management: Talend will create the hadoop job and will pass the job id information to
the YARN resource manager from there resource manager will take care of the job executions. It does not need to install any Talend related
libraries in Hadoop cluster but other tools need to install corresponding tool related big data libraries in the nodes which is part of the hadoop
cluster.
6. Talend for Big data Installation system Requirements:
1. Memory: 4 GB RAM
2. Disk Space: 3 GB
3. Recommended OS: Microsoft windows 7 professional, Linux ubuntu
4. Supported OS: Apple OSX
5. But it is perfectly working on my personal laptop with windows 10 home edition and
centos 7 too
6. Software: Java 8 JRE Oracle
7. Network connectivity with a properly installed and configured Hadoop cluster.
Pre-Requisites to learn Talend for Big Data?
Fundamentals of computers,sql,linux commands, Conditional statements(if .. then else)
is Java programming Mandatory for Talend DI job design?
No it is not mandatory. Once in a while when a business requirement is beyond existing
tool functionality we may have to write code routines to fulfill custom requirements.
Talend tool GUI basics :
What is Workspace?
What is project? How to create/Delete/import demo or existing project ?
Types of repository connections to connect Talend Studio?
7. GUI tools and Features:
Main window, Tool bar, Repository Tree view
Designer workspace, Palette, Configuration Tabs
Outline, Code Viewer
Window Show view to bring up other configuration tabs to the main window
Metadata:
Centralize connections – File, Data base, Hadoop
What is connection, types of connections and need of connection links?
Row Connections, Trigger Connections
Sample job design and execution in Talend studio?
8. • Agenda: Topics covered in this Demo
• What is Big data and characteristics of a Big Data platform
• Why many customers are running after Hadoop stack
• Physical components of Hadoop cluster and its architecture
• Hadoop eco system components and use of each component
• Challenges in implementing Big data projects using conventional hadoop
• Positives and negatives in using Talend DI for Big data compared to conventional
hadoop eco system components
• Talend client - server Architecture
• Mapreduce job use case with Hand coding and Talend DI job.
9. What is Big Data?
Big data is a data hosting platform to host data sets that are so large or complex that existing traditional data
processing applications are insufficient to deal with them.
•Big Data = Distributed computing + Fault tolerance
•Distributed Computing – concept of shared nothing storage + parallel processing
•Fault tolerance – An ability which enables a system to continue functioning properly in the event of partial failure.
Brief Explanation on DC, FLT Tolerance using how a gigantic file blocks storage and data access on the
client server architecture vs Distributed computing architecture.
10. List of some example Big data platforms:
• Hadoop -- Apache, Hortonworks, Cloudera, Mapr
• Teradata Aster Ncluster
• Pivotal Big data suite
• Amazon Redshift
• Azure sql warehouse
• Microsoft Azure HDInsight
Hadoop history and role of Google in Apache Hadoop project development?
Google problem statement and how they overcome using GFS + MapReduce.
Later Doug Cutting created Hadoop framework with reference to Google published white paper.
How hadoop is different from traditional technologies?
• Hadoop has Big data characteristics Distributed computing, Fault tolerance
• Due to inexpensive storage cost, one can build data lake on HDFS layer
What are the advantages of using Hadoop? In cost and Architectural feasibility prospective.
• Horizontal resource scalability
• Processing of large and/or rapidly growing data sets either structured or non-structured
• Affordable commodity hardware
• open source
• Move computation towards data rather than transfer data towards computation
11. High level hadoop cluster architecture and physical core components:
Multiple Nodes in a Rack Multiple Racks in a Data center Nodes, Racks from different
Data centers in various geographic location could be part of hadoop cluster
12. Hadoop eco system components:
The two main components of Apache Hadoop are HDFS for storage and MapReduce for
data processing.
• Flume
• Sqoop
• Zookeeper
• Oozie
• Pig
• Hive
• HBase
• Solr
13. what are the challenges in Implementing a Big data project with conventional Hadoop
framework?
• Agile methodology in Big data projects is nightmare
• Finding the right resources with MapReduce,Scala -spark skills in the market is big challenge.
• Addressing /Justifying Business : Trying to convert existing reports which were been using from past
decade to make it work for Big data rather you may have to make them understand the need of adding
Big Data to add Predictive Analytics features to compete with the business competitor.
Pros and Cons in using Talend BD DI compared to conventional Hadoop eco system
components?
Pros:
• Graphical Development of Big Data and Hadoop Jobs
• Leverage existing technical resources with bare minimum training investment
• Speedup the Big data projects with Agile methodology implementations
• Seamless tight integration is possible with related subject areas like Data quality, event-based
job scheduling, Master data management
• Runtime server execution
Constraints:
Version compatibility dependencies between Talend and Hadoop distribution.
14. Talend Architecture and its components:
Nexus Repository
Meta data repository
Talend Administration center
Admin/Audit/Monitoring
Execution Servers
15. Lab Practical on Joining 2 HDFS files and aggregate data using Talend job with
Mapreduce Engine
Lab Practical on Joining 2 HDFS files and aggregate data using conventional hadoop
hand coded Mapreduce job