Talend Open Studio Data Integration


  1. 1. Talend Data Integration and Management
  Data Integration Data Integration involves combining data residing in differente sources and providing the user with a unified view of the dataData Management combines different disciplines to manage data as a valuable resource
  Talend● Talend is a company focused on Data Integration and Data Management solutions● Talend is a „Cool Vendor" for Gartner (2010)● Present in more than 12 locations around the World● Fast growing company
  Talend Open Studio
  Talend Open Studio● Open Source, professional tool● Draw procedures linking components, each component performs an operation● DB vendor-specific optimized components● Produces fully editable Java (or Perl) code● Deployment with small and fast compiled Java or as Web Service● Eclipse based IDE, excellent flexibility● BI Platform indipendent, DB Vendor indipendent
  Automatic code generation, diffent deployment
  Extracion Transformation Loading● ETL is a common process in Data Integration ● Extract, reading data from different datasources (database, flat files, spreadsheet files, web services, etc) ● Transfom, converting data in a form so that it can be placed in another container (database, web services, files, etc). Cleaning, computations and verifications are also performed ● Load, write the data in the target format
  Tutorial, Source data
  Tutorial, Destination data (Datawarehouse)
  Tutorial, Metadata● Talend requires a preliminary definition of the metadata● Often a strong metadata definition means, as in programming languages, fast, robust and maintenable applications● ..demo..
  Tutorial, Talend jobs basics● Place components on the designer● Link components to build a transformation● Main type of link: Rows flow● Schema metadata is propagated and must be coherent● ..demo..
  Tutorial, users_dimension
  Test the job
  Tutorial, accounts_dimension
  Tutorial, dates_dimension
  Tutorial, write a Java library
  Tutorial, opportunities_fact
  Tutorial, define a root job
  Deploy and run
  Extensibility, comunity plugins ● Many official components ● Components for every task released by the comunity ● Geospatial components, log analysis, Google analytics, data encryption, etc
  Scheduler
  And now.. reports, dashboards, OLAP, Geoanalysis, KPIs..
  Do you trust your data?
  What about data quality?● Customer A is present 5 times with different names● Null values can vary statistical indexes like mean calculation● Duplicated records● Blank values● Some records can contain errors (es -1 field values)● Some records can be garbage
  Talend Open Profiler
  What abount data storage size?● Some fields can be oversized for the data they contain● Sometimes fields are related and can be calculated● Some keys or values are never used● When data grow garbage grow● Data storage is not free (disks, electricity, backups, DB licenses)
  Data is „the black gold" that can produce knowledge● Data is a resource, you can extract knowledge● A lot of Data produces concise informations● Data storage is not free and a lot of data can make system not fast● Data cleansing is a central process in statistical analysis and Data Mining
  Talend Master Data Management