Successful AI/ML Projects with End-to-End Cloud Data Engineering

`
Successful AI/ML Projects with
End-to-End Cloud Data Engineering
Louis Polycarpou
Technical Director
Cloud, Data Engineering, and Data Integration

2 © Informatica. Proprietary and Confidential.2 © Informatica. Proprietary and Confidential.2 © Informatica. Proprietary and Confidential.
AI/ML Projects in the Enterprise Today
Only 1% of AI/ML
projects are
successful
*Source: Databricks research 2018

Why are AI/ML Projects so difficult?
• Data Scientists spend 80% of their time in preparing data.. only 20% on modeling
• Data challenges – data is coming in at high volume, high velocity from a variety of
sources
• Enterprise data can not be provisioned if it lacks governance or is hidden
• Lost productivity in repetitive data pipelines to move and prepare data
• Data Engineers spend too much time capacity planning of Big Data processing
End-to-End Data Engineering holds the Key!

End-to-End Data Engineering is Key to ML Projects
ANY
DATA
ANY
REGULATION
ANY
USER
ANY CLOUD / ANY TECHNOLOGY
ANY
LATENCY
METADATA
GOVERNANCE
INGEST STREAM INTEGRATE CLEANSE PREPARE DEFINE CATALOG RELATE PROTECT DELIVERENRICH
HYBRID
MODERN DATA INTEGRATION PATTERNS

Informatica Data Engineering Integration
Informatica + Databricks
Accelerate Data Engineering Pipelines for AI & Analytics
Informatica Cloud
Data Integration
Informatica Enterprise Data Catalog
Reliable Data Lakes at Scale
Data Discovery, Audit and Lineage
Data Pipeline Development
Data Ingestion from
Hybrid Sources

6 © Informatica. Proprietary and Confidential.6 © Informatica. Proprietary and Confidential.
Informatica Enterprise Data Catalog
• Comprehensive discovery of data assets for accurate
machine learning models
• Easily find and discover trusted data for building
machine learning models
• Explore holistic data relationships
• End-to-End data lineage through the analytics process
• Integrated Business Glossary
• Crowd-sourced curation of data assets
• Machine-learning-based semantic inference and
recommendations

7 © Informatica. Proprietary and Confidential.7
Informatica Data Engineering Portfolio
The industry’s most comprehensive data engineering solution for
multi-cloud & hybrid environments in Spark “true” serverless mode
Data Engineering Integration
(DEI)
Data Engineering Streaming
(DES)
Data Engineering Quality
(DEQ)
Data Engineering Masking
(DEM)
Intelligently manage
data pipelines for faster
insights. Data ingestion and
processing
Turn volumes of streaming and
IoT data into trusted insights
Govern all your data on Spark
in cloud and other
environments to ensure it’s
trusted and relevant
De-identify, de-sensitize, and
anonymize sensitive data from
unauthorized access for app
users, BI, and AI & analytics

No Code,
No Ops,
No Limits
On Data

select l_orderkey, sum(l_extendedprice * (1 -
l_discount)) as revenue, o_orderdate, o_shippriority
from CUSTOMER, ORDERS, LINEITEM where
c_mktsegment = 'AUTOMOBILE' and c_custkey =
o_custkey and l_orderkey = o_orderkey and
o_orderdate < date '1995-03-13' and l_shipdate > date
'1995-03-13' group by l_orderkey, o_orderdate,
o_shippriority order by revenue desc, o_orderdate limit
10;
SQL Query
No Code: Leverage the Power of Easy-to-Use Interface
Spark Code
package main.scala
import org.apache.spark.sql.DataFrame
import org.apache.spark.SparkContext
import org.apache.spark.sql.functions.sum
import org.apache.spark.sql.functions.udf
/**
* Query 3
*
*/
class Q03 extends TpchQuery {
override def execute(sc: SparkContext, schemaProvider: TpchSchemaProvider):
DataFrame = {
// this is used to implicitly convert an RDD to a DataFrame.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import schemaProvider._
val decrease = udf { (x: Double, y: Double) => x * (1 - y) }
val fcust = customer.filter($"c_mktsegment" === "BUILDING")
val forders = order.filter($"o_orderdate" < "1995-03-15")
val flineitems = lineitem.filter($"l_shipdate" > "1995-03-15")
fcust.join(forders, $"c_custkey" === forders("o_custkey"))
.select($"o_orderkey", $"o_orderdate", $"o_shippriority")
.join(flineitems, $"o_orderkey" === flineitems("l_orderkey"))
.select($"l_orderkey",
decrease($"l_extendedprice", $"l_discount").as("volume"),
$"o_orderdate", $"o_shippriority")
.groupBy($"l_orderkey", $"o_orderdate", $"o_shippriority")
.agg(sum($"volume").as("revenue"))
.sort($"revenue".desc, $"o_orderdate")
.limit(10)
}
}
DEI Mapping
Future proof your investments, design once
and run on best-of-breed engine

No Code: Schema Drift Handling
Handle complex structure and its changes
for both batch and streaming data

No Ops: Azure Databricks Support
Leverage the compute power of Databricks
on Azure for big data processing

No Ops: Advanced Spark Support
Take advantage of latest innovation,
performance, and scaling benefits

No Ops: Operational Insights
Deliver predictive operational insights about
your data engineering environments

No Limits on Data: Ingest Any Data in Real-time & Batch
Mass ingestion of streaming/
IoT data, files, and databases

No Limits on Data: High-Speed Mass Ingestion
Rely on easy to use, fast, and scalable
approach—no hand-coding

No Limits on Data: Spark Structured Streaming Support
Handle streaming data based on event
time instead of processing time

RELATIONAL
DEVICE DATA
WEBLOGS
Cloud-Ready Reference Architecture
Informatica + Azure Databricks
CATALOG SEARCH LINEAGE RECOMMENDATIONSPARSE MATCH
ACQUIRE INGEST PREPARE CATALOG SECURE GOVERN ACCESS CONSUME
Storage blobStorage blob SQL Data
Warehouse
ADLS /
BLOB
Azure Databricks ADLS /
BLOB

Takeda Technical Architecture
18
MARKET
CENTER
Data Sources Data SourcesData Sources
Informatica Data Engineering
Integration (DEI) and IICS
[IaaS]
Streaming
[PaaS]
STAGE
Storage
LAKE
Storage
HUB
Storage
MART
Storage
Databricks
[PaaS]
Data Visualization
[IaaS]
Self Server
Analytics
[PaaS]
Hadoop
[PaaS]
Storage
[PaaS]
Data Visualization
[SaaS]
Storage
[PaaS]
Databricks
[PaaS]
Analytics
COMM
Analytics
CORP
Analytics
GMS
…
Informatica

Critical Success Factors of your AI/ML Projects
1 Find & discover data across all enterprise systems
2Accelerate movement of data to Databricks
3 Prepare & enrich the data before you start modeling
4Increase productivity with no-code UI for data engineering
5 Go serverless by processing data pipelines on Databricks

Learn More
1. Stop by the Informatica booth #90 for a custom demo
2. Hear more about AI-Powered Streaming Analytics for Real-Time Customer
Experience – Tomorrow 11:00am Room: E102
3. Visit http://www.informatica.com/databricks
4. Sign up for Hands-on Workshops on Serverless Cloud Data Lakes

`
Thank You!
Louis Polycarpou
Technical Director
Cloud, Data Engineering, and Data Integration

Successful AI/ML Projects with End-to-End Cloud Data Engineering

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Successful AI/ML Projects with End-to-End Cloud Data Engineering

Similar to Successful AI/ML Projects with End-to-End Cloud Data Engineering (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Successful AI/ML Projects with End-to-End Cloud Data Engineering