Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Engineering Machine Learning Data Pipeline Series: Pulling in Data from Multiple Sources

98 views

Published on

Making the hurdle from designing a machine learning model to putting it into production, is the key to getting value back – and the roadblock that stops many promising machine learning projects. After the data scientists have done their part, engineering a robust production data pipeline has its own set of tough problems to solve. Syncsort’s solutions help the data engineer every step of the way.

The first step is consolidating data from sources all over the enterprise. The data machine learning models come from a wide variety of physical locations, technical platforms and storage formats. The first challenge is requiring parallel onboarding capability and connectivity to sources from mainframe to streaming to Cloud and getting all that data onto the cluster. The next challenge is getting all the data transformed from its source storage format to the target, whether that system is Hive, Impala, HDFS, ORC, Parquet, KUDU or something else entirely. The final challenge is getting the data normalized, aggregated – or otherwise changed – and the features filtered down.

This is only the first part of creating robust production data pipelines, and if you’re not careful it can take weeks or even months of Sqoop scripts, shell scripting and Scala or Java code to complete the first step. Syncsort has helped data engineers solve this problem for years.

View this 15-minute webcast on-demand to get a deeper look at a better way to get high-performance data access and integration on your production cluster – without spending a bunch of time coding or tuning. These 15-minutes could save you weeks!

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Engineering Machine Learning Data Pipeline Series: Pulling in Data from Multiple Sources

  1. 1. Engineering Machine Learning Data Pipelines Data from Multiple Sources Paige Roberts Integrate Product Marketing Manager
  2. 2. Common Machine Learning Applications Engineering Machine Learning Data Pipelines • Anti-money laundering • Fraud detection • Cybersecurity • Targeted marketing • Recommendation engine • Next best action • Customer churn prevention • Know your customer 2
  3. 3. “ ” For want of a nail, the kingdom was lost. For want of a data cleansing and integration tool, the whole AI superstructure can fall down. James Kobeilus SiliconANGLE Wikibon Lead Analyst for Data Science, Deep Learning, App Development 2018 Engineering Machine Learning Data Pipelines3
  4. 4. Data Scientist Engineering Machine Learning Data Pipelines4 Data Engineer to the Rescue • Expert in statistical analysis, machine learning techniques, finding answers to business questions buried in datasets. • Does NOT want to spend 50 – 90% of their time tinkering with data, getting it into good shape to train models – but frequently does, especially if there’s no data engineer on their team. • Job completed when machine learning model is trained, tested, and proven it will accomplish the goal. Not skilled at taking the model from a test sandbox into production. • Expert in data structures, data manipulation, and constructing production data pipelines. • WANTS to spend all of their time working with data. • First gathers, cleans and standardizes data, helps data scientist with feature engineering, provides top notch data, ready to train models. • After model is tested, builds robust high scale, data pipelines to feed the models the data they need in the correct format in production to provide ongoing business value. Data Engineer
  5. 5. Engineering Machine Learning Data Pipelines5 Five Big Challenges of Engineering ML Data Pipelines 1. Scattered and Difficult to Access Datasets Much of the necessary data is trapped in mainframes or streams in from POS, web clicks, etc. all in incompatible formats, making it difficult to gather and prepare the data for model training.
  6. 6. Engineering Machine Learning Data Pipelines6 Onboard Relational Data Quickly • Db2, Oracle, Teradata, Netezza, S3, Redshift, … • Onboard hundreds of tables into your cluster • Onboard whole database schemas at once • Create target tables automatically in Hive or Impala • Filter unwanted tables, rows, data types, or columns with a mouse click • Transform data in flight DMX DataFunnel™
  7. 7. Engineering Machine Learning Data Pipelines7 Onboard ALL Enterprise Data – Mainframe to Streaming Data Sources Access data from streaming and batch sources outside cluster.
  8. 8. Engineering Machine Learning Data Pipelines8 Onboard ALL Enterprise Data – Mainframe to Streaming Data Sources Onboard data, modify on-the-fly to match Hadoop storage model, or store unchanged for archive and compliance. Access data from streaming and batch sources outside cluster.
  9. 9. Engineering Machine Learning Data Pipelines9 Onboard ALL Enterprise Data – Mainframe to Streaming Data Sources Onboard data, modify on-the-fly to match Hadoop storage model, or store unchanged for archive and compliance. Access data from streaming and batch sources outside cluster. Data Lake Data Transform, join, cleanse, enhance data in cluster with MapReduce, EMR, or Spark.
  10. 10. 10 Design Once, Deploy Anywhere Engineering Machine Learning Data Pipelines Intelligent Execution - Insulate your organization from underlying complexities of Hadoop. Get excellent performance every time without tuning, load balancing, etc. No re-design, re-compile, no re-work ever • Future-proof job designs for emerging compute frameworks, e.g. Spark 2.x • Move from dev to test to production • Move from on-premise to Cloud • Move from one Cloud to another Use existing ETL skills No parallel programming – Java, MapReduce, Spark … No worries about: • Mappers, Reducers • Big side or small side of joins … Design Once in visual GUI Deploy Anywhere! On-Premise, Cloud Mapreduce, Spark, Future Platforms Windows, Unix, Linux Batch, Streaming Single Node, Cluster
  11. 11. 11 Same Solution – On Premise or In the Cloud Engineering Machine Learning Data Pipelines Big Data + Cloud + Syncsort = Powerful, Flexible, Cost Effective • ETL engine on AWS Marketplace • Available on EC2, EMR, Google Cloud • S3 and Redshift connectivity • Google GCS and Amazon S3 support • First & only leading ETL engine on Docker Hub • Partner with all major public Cloud providers
  12. 12. 12 Bring ALL Enterprise Data Securely to the Data Lake Engineering Machine Learning Data Pipelines • Collect virtually any data from mainframe to Cloud, relational to NoSQL • Batch & streaming sources – Kafka, MapR Streams • Access, re-format and load data directly into Hive & Impala. No staging required! • Pull hundreds of tables at once into your data hub, whole DB schemas at the push of a button • Load more data into Hadoop in less time Build Your Enterprise Data Hub
  13. 13. Engineering Machine Learning Data Pipelines13
  14. 14. Engineering Machine Learning Data Pipelines14

×