Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

Share

Successful AI/ML Projects with End-to-End Cloud Data Engineering

Download to read offline

Trusted, high-quality data and efficient use of data engineers’ time are critical success factors for AI/ML projects. Enterprise data is complex—it comes from several sources, in a variety of formats, and at varied speeds. For your machine learning projects on Apache Spark, you need a holistic approach to data engineering: finding & discovering, ingesting & integrating, server-less processing at scale, and data governance. Stop by this session for an overview on how to set up AI/ML projects for success while Informatica takes the heavy lifting out of your data engineering.

Successful AI/ML Projects with End-to-End Cloud Data Engineering

  1. 1. ` Successful AI/ML Projects with End-to-End Cloud Data Engineering Louis Polycarpou Technical Director Cloud, Data Engineering, and Data Integration
  2. 2. 2 © Informatica. Proprietary and Confidential.2 © Informatica. Proprietary and Confidential.2 © Informatica. Proprietary and Confidential. AI/ML Projects in the Enterprise Today Only 1% of AI/ML projects are successful *Source: Databricks research 2018
  3. 3. 3 © Informatica. Proprietary and Confidential.3 © Informatica. Proprietary and Confidential.3 © Informatica. Proprietary and Confidential. Why are AI/ML Projects so difficult? • Data Scientists spend 80% of their time in preparing data.. only 20% on modeling • Data challenges – data is coming in at high volume, high velocity from a variety of sources • Enterprise data can not be provisioned if it lacks governance or is hidden • Lost productivity in repetitive data pipelines to move and prepare data • Data Engineers spend too much time capacity planning of Big Data processing End-to-End Data Engineering holds the Key!
  4. 4. End-to-End Data Engineering is Key to ML Projects ANY DATA ANY REGULATION ANY USER ANY CLOUD / ANY TECHNOLOGY ANY LATENCY METADATA GOVERNANCE INGEST STREAM INTEGRATE CLEANSE PREPARE DEFINE CATALOG RELATE PROTECT DELIVERENRICH HYBRID MODERN DATA INTEGRATION PATTERNS
  5. 5. Informatica Data Engineering Integration Informatica + Databricks Accelerate Data Engineering Pipelines for AI & Analytics Informatica Cloud Data Integration Informatica Enterprise Data Catalog Reliable Data Lakes at Scale Data Discovery, Audit and Lineage Data Pipeline Development Data Ingestion from Hybrid Sources
  6. 6. 6 © Informatica. Proprietary and Confidential.6 © Informatica. Proprietary and Confidential. Informatica Enterprise Data Catalog • Comprehensive discovery of data assets for accurate machine learning models • Easily find and discover trusted data for building machine learning models • Explore holistic data relationships • End-to-End data lineage through the analytics process • Integrated Business Glossary • Crowd-sourced curation of data assets • Machine-learning-based semantic inference and recommendations
  7. 7. 7 © Informatica. Proprietary and Confidential.7 Informatica Data Engineering Portfolio The industry’s most comprehensive data engineering solution for multi-cloud & hybrid environments in Spark “true” serverless mode Data Engineering Integration (DEI) Data Engineering Streaming (DES) Data Engineering Quality (DEQ) Data Engineering Masking (DEM) Intelligently manage data pipelines for faster insights. Data ingestion and processing Turn volumes of streaming and IoT data into trusted insights Govern all your data on Spark in cloud and other environments to ensure it’s trusted and relevant De-identify, de-sensitize, and anonymize sensitive data from unauthorized access for app users, BI, and AI & analytics
  8. 8. No Code, No Ops, No Limits On Data
  9. 9. 9 © Informatica. Proprietary and Confidential.9 select l_orderkey, sum(l_extendedprice * (1 - l_discount)) as revenue, o_orderdate, o_shippriority from CUSTOMER, ORDERS, LINEITEM where c_mktsegment = 'AUTOMOBILE' and c_custkey = o_custkey and l_orderkey = o_orderkey and o_orderdate < date '1995-03-13' and l_shipdate > date '1995-03-13' group by l_orderkey, o_orderdate, o_shippriority order by revenue desc, o_orderdate limit 10; SQL Query No Code: Leverage the Power of Easy-to-Use Interface Spark Code package main.scala import org.apache.spark.sql.DataFrame import org.apache.spark.SparkContext import org.apache.spark.sql.functions.sum import org.apache.spark.sql.functions.udf /** * Query 3 * */ class Q03 extends TpchQuery { override def execute(sc: SparkContext, schemaProvider: TpchSchemaProvider): DataFrame = { // this is used to implicitly convert an RDD to a DataFrame. val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.implicits._ import schemaProvider._ val decrease = udf { (x: Double, y: Double) => x * (1 - y) } val fcust = customer.filter($"c_mktsegment" === "BUILDING") val forders = order.filter($"o_orderdate" < "1995-03-15") val flineitems = lineitem.filter($"l_shipdate" > "1995-03-15") fcust.join(forders, $"c_custkey" === forders("o_custkey")) .select($"o_orderkey", $"o_orderdate", $"o_shippriority") .join(flineitems, $"o_orderkey" === flineitems("l_orderkey")) .select($"l_orderkey", decrease($"l_extendedprice", $"l_discount").as("volume"), $"o_orderdate", $"o_shippriority") .groupBy($"l_orderkey", $"o_orderdate", $"o_shippriority") .agg(sum($"volume").as("revenue")) .sort($"revenue".desc, $"o_orderdate") .limit(10) } } DEI Mapping Future proof your investments, design once and run on best-of-breed engine
  10. 10. 10 © Informatica. Proprietary and Confidential.10 No Code: Schema Drift Handling Handle complex structure and its changes for both batch and streaming data
  11. 11. 11 © Informatica. Proprietary and Confidential.11 No Ops: Azure Databricks Support Leverage the compute power of Databricks on Azure for big data processing
  12. 12. 12 © Informatica. Proprietary and Confidential.12 No Ops: Advanced Spark Support Take advantage of latest innovation, performance, and scaling benefits
  13. 13. 13 © Informatica. Proprietary and Confidential.13 No Ops: Operational Insights Deliver predictive operational insights about your data engineering environments
  14. 14. 14 © Informatica. Proprietary and Confidential.14 No Limits on Data: Ingest Any Data in Real-time & Batch Mass ingestion of streaming/ IoT data, files, and databases
  15. 15. 15 © Informatica. Proprietary and Confidential.15 No Limits on Data: High-Speed Mass Ingestion Rely on easy to use, fast, and scalable approach—no hand-coding
  16. 16. 16 © Informatica. Proprietary and Confidential.16 No Limits on Data: Spark Structured Streaming Support Handle streaming data based on event time instead of processing time
  17. 17. 17 © Informatica. Proprietary and Confidential.17 © Informatica. Proprietary and Confidential. RELATIONAL DEVICE DATA WEBLOGS Cloud-Ready Reference Architecture Informatica + Azure Databricks CATALOG SEARCH LINEAGE RECOMMENDATIONSPARSE MATCH ACQUIRE INGEST PREPARE CATALOG SECURE GOVERN ACCESS CONSUME Storage blobStorage blob SQL Data Warehouse ADLS / BLOB Azure Databricks ADLS / BLOB
  18. 18. 18 © Informatica. Proprietary and Confidential.18 © Informatica. Proprietary and Confidential. Takeda Technical Architecture 18 MARKET CENTER Data Sources Data SourcesData Sources Informatica Data Engineering Integration (DEI) and IICS [IaaS] Streaming [PaaS] STAGE Storage LAKE Storage HUB Storage MART Storage Databricks [PaaS] Data Visualization [IaaS] Self Server Analytics [PaaS] Hadoop [PaaS] Storage [PaaS] Data Visualization [SaaS] Storage [PaaS] Databricks [PaaS] Analytics COMM Analytics CORP Analytics GMS … Informatica
  19. 19. 19 © Informatica. Proprietary and Confidential.19 © Informatica. Proprietary and Confidential.19 © Informatica. Proprietary and Confidential. Critical Success Factors of your AI/ML Projects 1 Find & discover data across all enterprise systems 2Accelerate movement of data to Databricks 3 Prepare & enrich the data before you start modeling 4Increase productivity with no-code UI for data engineering 5 Go serverless by processing data pipelines on Databricks
  20. 20. 20 © Informatica. Proprietary and Confidential.20 © Informatica. Proprietary and Confidential.20 © Informatica. Proprietary and Confidential. Learn More 1. Stop by the Informatica booth #90 for a custom demo 2. Hear more about AI-Powered Streaming Analytics for Real-Time Customer Experience – Tomorrow 11:00am Room: E102 3. Visit http://www.informatica.com/databricks 4. Sign up for Hands-on Workshops on Serverless Cloud Data Lakes
  21. 21. ` Thank You! Louis Polycarpou Technical Director Cloud, Data Engineering, and Data Integration
  • tushar_kale

    Nov. 10, 2019

Trusted, high-quality data and efficient use of data engineers’ time are critical success factors for AI/ML projects. Enterprise data is complex—it comes from several sources, in a variety of formats, and at varied speeds. For your machine learning projects on Apache Spark, you need a holistic approach to data engineering: finding & discovering, ingesting & integrating, server-less processing at scale, and data governance. Stop by this session for an overview on how to set up AI/ML projects for success while Informatica takes the heavy lifting out of your data engineering.

Views

Total views

554

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

34

Shares

0

Comments

0

Likes

1

×