Data Integration Data Integration involves combining data residing in differente sources and providing the user with a unified view of the dataData Management combines different disciplines to manage data as a valuable resource www.robertomarchetto.com
Talend● Talend is a company focused on Data Integration and Data Management solutions● Talend is a „Cool Vendor“ for Gartner (2010)● Present in more than 12 locations around the World● Fast growing company www.robertomarchetto.com
Talend Open Studio● Open Source, professional tool● Draw procedures linking components, each component performs an operation● DB vendor-specific optimized components● Produces fully editable Java (or Perl) code● Deployment with small and fast compiled Java or as Web Service● Eclipse based IDE, excellent flexibility● BI Platform indipendent, DB Vendor indipendent www.robertomarchetto.com
Extracion Transformation Loading● ETL is a common process in Data Integration ● Extract, reading data from different datasources (database, flat files, spreadsheet files, web services, etc) ● Transfom, converting data in a form so that it can be placed in another container (database, web services, files, etc). Cleaning, computations and verifications are also performed ● Load, write the data in the target format www.robertomarchetto.com
Tutorial, Destination data (Datawarehouse) www.robertomarchetto.com
Tutorial, Metadata● Talend requires a preliminary definition of the metadata● Often a strong metadata definition means, as in programming languages, fast, robust and maintenable applications● ..demo.. www.robertomarchetto.com
Tutorial, Talend jobs basics● Place components on the designer● Link components to build a transformation● Main type of link: Rows flow● Schema metadata is propagated and must be coherent● ..demo.. www.robertomarchetto.com
Extensibility, comunity plugins ● Many official components ● Components for every task released by the comunity ● Geospatial components, log analysis, Google analytics, data encryption, etc www.robertomarchetto.com
And now.. reports, dashboards, OLAP, Geoanalysis, KPIs.. www.robertomarchetto.com
Do you trust your data? www.robertomarchetto.com
What about data quality?● Customer A is present 5 times with different names● Null values can vary statistical indexes like mean calculation● Duplicated records● Blank values● Some records can contain errors (es -1 field values)● Some records can be garbage www.robertomarchetto.com
What abount data storage size?● Some fields can be oversized for the data they contain● Sometimes fields are related and can be calculated● Some keys or values are never used● When data grow garbage grow● Data storage is not free (disks, electricity, backups, DB licenses) www.robertomarchetto.com
Data is „the black gold“ that can produce knowledge● Data is a resource, you can extract knowledge● A lot of Data produces concise informations● Data storage is not free and a lot of data can make system not fast● Data cleansing is a central process in statistical analysis and Data Mining www.robertomarchetto.com
Talend Master Data Management www.robertomarchetto.com
A particular slide catching your eye?
Clipping is a handy way to collect important slides you want to go back to later.