Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The road to AI is paved with pragmatic intentions

106 views

Published on

Those slides were used for NC Tech's lunch and learn on Aug. 22 2018.

This lunch and learn, hosted by Veracity Solutions, you will learn how Spark can help your business build a pragmatic technology roadmap to AI (Artificial Intelligence), Machine Learning, and Big Data analytics. Apache Spark is a wonderful platform for distributed data processing and analytics, but how is it used by different organizations? How difficult is it to on-board a team, what technology do they need to master before on-boarding, do they have to master Scala or simply use their Java skills? You will find answers to those questions, get a realistic perspective on the platform, and see code (because we are all a bit geeks, right?)

Full link to the event: https://www.nctech.org/events/event/2018/lunch-and-learn-august22.html.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

The road to AI is paved with pragmatic intentions

  1. 1. CONFIDENTIAL © 2018 The road to AI is paved with pragmatic intentions Jean Georges “JG" Perrin August 22nd 2018
  2. 2. CONFIDENTIAL © 2018 JGP • Jean Georges Perrin • @jgperrin • Chapel Hill, NC • I 🏗 SW • Since 1983 • #Knowledge = 
 𝑓 ( ∑ (#SmallData, #BigData), #DataScience)
 & #Software  • #IBMChampion x10 • #KeepLearning • @ http://jgp.net
  3. 3. CONFIDENTIAL © 2018
  4. 4. CONFIDENTIAL © 2018 Who are thou? • Experience with Spark? • Who is familiar with Data Quality? • Who has already implemented Data Quality in Spark? • Who is expecting to be an AI guru after this session? • Who just came for the free food? • Who is expecting to provide better insights faster after this session?
  5. 5. CONFIDENTIAL © 2018 • What is ? • What can I do with ? • What is a app, anyway? • What’s AI? • Why is a great environment for AI? • Meet Cactar, the Mongolian warlord of data quality • Why data quality matters? • Your first AI app with • And finally a little surprise… Agenda
  6. 6. CONFIDENTIAL © 2018 Analytics operating system
  7. 7. CONFIDENTIAL © 2018 Apps Analytics Distrib. An analytics operating system? Hardware OS Apps HardwareHardware OS OS Distributed OS Analytics OS Apps HardwareHardware OS OS Apps
  8. 8. CONFIDENTIAL © 2018 An analytics operating system? HardwareHardware OS OS Distributed OS Analytics OS Apps {
  9. 9. CONFIDENTIAL © 2018 An analytics operating system? HardwareHardware OS OS Distributed OS Analytics OS Apps {
  10. 10. CONFIDENTIAL © 2018 An analytics operating system? HardwareHardware OS OS Distributed OS Analytics OS Apps {
  11. 11. CONFIDENTIAL © 2018 use cases • NCEatery.com • Restaurant analytics • 1.57×1021 datapoints analyzed (that’s about one zetta datapoints) • (@ Lumeris) • General compute • Distributed data transfer • IBM • DSX (Data Science Experience) • Event Store - http://jgp.net/2017/06/22/spark-boosts-ibm-event-store/ • CERN • Analysis of the science experiments in the LHC - Large Hadron Collider
  12. 12. CONFIDENTIAL © 2018 Spark SQL Spark Streaming MLlib (machine learning) GraphX (graph) Apache Spark
  13. 13. CONFIDENTIAL © 2018 Node 1 - OS Node 2 - OS Node 3 - OS Node 4 - OS Node 1 - Hardware Node 2 - Hardware Node 3 - Hardware Node 4 - Hardware Unified API Spark SQL Spark Streaming Machine Learning (& Deep Learning) GraphX Node 5 - OS Node 5 - Hardware Your Application … …
  14. 14. CONFIDENTIAL © 2018 Node 1 Node 2 Node 3 Node 4 Unified API Spark SQL Spark Streaming Machine Learning (& Deep Learning) GraphX Node 5 Your Application … DataFrame
  15. 15. CONFIDENTIAL © 2018 Spark SQL Spark Streaming Machine Learning (& Deep Learning) GraphX DataFrame
  16. 16. CONFIDENTIAL © 2018 http://bit.ly/spark-clego
  17. 17. CONFIDENTIAL © 2018 What’s #AI?
  18. 18. CONFIDENTIAL © 2018 Popular beliefs • Robot with human-like behavior • HAL from 2001 • Isaac Asimov • Potential ethic problems • Lots of mathematics • Heavy calculations • Algorithms • Self-driving cars Current state-of-the-art General AI Narrow AI
  19. 19. CONFIDENTIAL © 2018 In 2018… I am an expert in general AI ARTIFICIAL INTELLIGENCE is Machine Learning
  20. 20. CONFIDENTIAL © 2018 Machine learning • Common algorithms • Linear and logistic regressions • Classification and regression trees • K-nearest neighbors (KNN) • Deep learning • Subset of ML • Artificial neural networks (ANNs) • Super CPU intensive, use of GPU
  21. 21. CONFIDENTIAL © 2018 There are two kinds of data scientists: 1) Those who can extrapolate from incomplete
 data. -The Internet
 and my personal dedication to Sam Christie
  22. 22. CONFIDENTIAL © 2018 DATA Engineer DATA Scientist Adapted from: https://www.datacamp.com/community/blog/data-scientist-vs-data-engineer Develop, build, test, and operationalize datastores and large-scale processing systems. DataOps is the new DevOps. Clean, massage, and organize data. Perform statistics and analysis to develop insights, build models, and search for innovative correlations. Match architecture with business needs. Develop processes for data modeling, mining, and pipelines. Improve data reliability and quality. Prepare data for predictive models. Explore data to find hidden gems and patterns. Tells stories to key stakeholders.
  23. 23. CONFIDENTIAL © 2018 Adapted from: https://www.datacamp.com/community/blog/data-scientist-vs-data-engineer DATA Engineer DATA Scientist SQL
  24. 24. CONFIDENTIAL © 2018 1 3 5 7 9 11 13 15 17 Jan-14 Jul-14 Jan-15 Jul-15 Jan-16 Jul-16 Jan-17 Jul-17 Jan-18 Scala Java Python R Programming languages RedMonk programming language rankings 40 50 60 70 80 90 100 2014 2015 2016 2017 2018 Scala Java Python R SQL IEEE Spectrum, top programming languages
  25. 25. CONFIDENTIAL © 2018 xkcd As goes the old adage: Garbage in, Garbage out
  26. 26. CONFIDENTIAL © 2018 If Everything Was As Simple… Dinner revenue per number of guests
  27. 27. CONFIDENTIAL © 2018 …as a Visual Representation Anomaly #1 Anomaly #2
  28. 28. CONFIDENTIAL © 2018 I Love It When a Plan Comes Together
  29. 29. CONFIDENTIAL © 2018 Data is like a 
 box of chocolates, 
 you never know what you're gonna get. Jean ”Gump” Perrin June 2017
  30. 30. CONFIDENTIAL © 2018 Data from everywhere Databases RDBMS, NoSQL Files CSV, XML, Json, Excel, Photos, Video Machines Services, REST, Streaming, IoT…
  31. 31. CONFIDENTIAL © 2018 CACTAR is not a 
 Mongolian warlord
  32. 32. CONFIDENTIAL © 2018 What is data quality? Attributes (CACTAR): • Consistency, • Accuracy, • Completeness, • Timeliness, • Accessibility, • Reliability. To allow: • Operations, • Decision making, • Planning, • Machine Learning, • Artificial Intelligence.
  33. 33. CONFIDENTIAL © 2018 Ouch, bad data story Hubble was blind because of a 2.2nm error Legend says it’s a metric to imperial/ US conversion
  34. 34. CONFIDENTIAL © 2018 Now it hurts • Challenger exploded in 1986 
 because of a defective O-ring • Root causes were: • Invalid data for the 
 O-ring parts • Lack of data-lineage
 & reporting
 

  35. 35. CONFIDENTIAL © 2018 #1 Scripts and the likes • Source data are cleaned by scripts (shell, Python, Java app…) • I/O intensive • Storage space intensive • No parallelization
  36. 36. CONFIDENTIAL © 2018 #2 Use Spark SQL • All in memory! • But limited to SQL and built- in function
  37. 37. CONFIDENTIAL © 2018 #3 Use UDFs • User Defined Function • SQL can be extended with UDFs • UDFs benefit from the cluster architecture and distributed processing
  38. 38. CONFIDENTIAL © 2018 Enough! Let me code!
  39. 39. CONFIDENTIAL © 2018 SPRU your UDF • Service: 
 build your code, it might be already existing! • Plumbing: 
 connect your existing business logic to Spark via an UDF • Register: 
 the UDF in Spark • Use: 
 the UDF is available in Spark SQL and via callUDF()
  40. 40. CONFIDENTIAL © 2018 Code sample #1.2 - plumbing package net.jgp.labs.sparkdq4ml.dq.udf; 
 import org.apache.spark.sql.api.java.UDF1; import net.jgp.labs.sparkdq4ml.dq.service.*; 
 public class MinimumPriceDataQualityUdf implements UDF1< Double, Double > { public Double call(Double price) throws Exception { return MinimumPriceDataQualityService.checkMinimumPrice(price); } } /jgperrin/net.jgp.labs.sparkdq4ml If price is ok, returns price, if price is ko, returns -1
  41. 41. CONFIDENTIAL © 2018 Code sample #1.3 - register SparkSession spark = SparkSession .builder().appName("DQ4ML").master("local").getOrCreate(); spark.udf().register( "minimumPriceRule", new MinimumPriceDataQualityUdf(), DataTypes.DoubleType); /jgperrin/net.jgp.labs.sparkdq4ml
  42. 42. CONFIDENTIAL © 2018 Code sample #1.4 - use String filename = "data/dataset.csv"; Dataset<Row> df = spark.read().format("csv") .option("inferSchema", "true").option("header", "false") .load(filename); df = df.withColumn("guest", df.col("_c0")).drop("_c0"); df = df.withColumn("price", df.col("_c1")).drop("_c1"); df = df.withColumn( "price_no_min", callUDF("minimumPriceRule", df.col("price"))); df.createOrReplaceTempView("price"); df = spark.sql("SELECT guest, price_no_min AS price FROM price WHERE price_no_min > 0"); Using CSV, but could be Hive, JDBC, name it… /jgperrin/net.jgp.labs.sparkdq4ml
  43. 43. CONFIDENTIAL © 2018 Dataset with anomalies +-----+-----+ |guest|price| +-----+-----+ |   1|23.24| |    2|30.89| |    2|33.74| |    3|34.89| |    3|29.91| |    3| 38.0| |    4| 40.0| |    5|120.0| |    6| 50.0| |    6|112.0| |    8| 60.0| |    8|127.0| |    8|120.0| |    9|130.0| +-----+-----+
  44. 44. CONFIDENTIAL © 2018 Code sample #1.4 - use String filename = "data/dataset.csv"; Dataset<Row> df = spark.read().format("csv") .option("inferSchema", "true").option("header", "false") .load(filename); df = df.withColumn("guest", df.col("_c0")).drop("_c0"); df = df.withColumn("price", df.col("_c1")).drop("_c1"); df = df.withColumn( "price_no_min", callUDF("minimumPriceRule", df.col("price"))); df.createOrReplaceTempView("price"); df = spark.sql("SELECT guest, price_no_min AS price FROM price WHERE price_no_min > 0"); /jgperrin/net.jgp.labs.sparkdq4ml
  45. 45. CONFIDENTIAL © 2018 Highlight anomalies +-----+-----+------------+ |guest|price|price_no_min| +-----+-----+------------+ |    1| 23.1|        23.1| |    2| 30.0|        30.0| |    2| 33.0|        33.0| |    3| 34.0|        34.0| |   24|142.0|       142.0| |   24|138.0|       138.0| |   25|  3.0|        -1.0| |   26| 10.0|        -1.0| |   25| 15.0|        -1.0| |   26|  4.0|        -1.0| |   28| 10.0|        -1.0| |   28|158.0|       158.0| |   30|170.0|       170.0| |   31|180.0|       180.0| +-----+-----+------------+
  46. 46. CONFIDENTIAL © 2018 Code sample #1.4 - use String filename = "data/dataset.csv"; Dataset<Row> df = spark.read().format("csv") .option("inferSchema", "true").option("header", "false") .load(filename); df = df.withColumn("guest", df.col("_c0")).drop("_c0"); df = df.withColumn("price", df.col("_c1")).drop("_c1"); df = df.withColumn( "price_no_min", callUDF("minimumPriceRule", df.col("price"))); df.createOrReplaceTempView("price"); df = spark.sql("SELECT guest, price_no_min AS price FROM price WHERE price_no_min > 0"); /jgperrin/net.jgp.labs.sparkdq4ml
  47. 47. CONFIDENTIAL © 2018 Cleansed dataset +-----+-----+ |guest|price| +-----+-----+ |    1| 23.1| |    2| 30.0| |    2| 33.0| |    3| 34.0| |    3| 30.0| |    4| 40.0| |   19|110.0| |   20|120.0| |   22|131.0| |   24|142.0| |   24|138.0| |   28|158.0| |   30|170.0| |   31|180.0| +-----+-----+
  48. 48. CONFIDENTIAL © 2018 Data can now be used for ML • Convert/Adapt dataset to Features and Label • Required for Linear Regression in MLlib • Needs a column called label of type double • Needs a column called features of type VectorUDT
  49. 49. CONFIDENTIAL © 2018 Code sample #2 - register & use spark.udf().register( "vectorBuilder", new VectorBuilder(), new VectorUDT()); df = df.withColumn("label", df.col("price")); df = df.withColumn("features", callUDF("vectorBuilder", df.col("guest"))); 
 // ... Lots of complex ML code goes here ... double p = model.predict(features); System.out.println("Prediction for " + feature + " guests is " + p); /jgperrin/net.jgp.labs.sparkdq4ml
  50. 50. CONFIDENTIAL © 2018 Prediction for 40 guests… +-----+-----+-----+--------+------------------+ |guest|price|label|features|        prediction| +-----+-----+-----+--------+------------------+ |    1| 23.1| 23.1|   [1.0]|24.563807596513133| |    2| 30.0| 30.0|   [2.0]|29.595283312577884| |    2| 33.0| 33.0|   [2.0]|29.595283312577884| |    3| 34.0| 34.0|   [3.0]| 34.62675902864264| |    3| 30.0| 30.0|   [3.0]| 34.62675902864264| |    3| 38.0| 38.0|   [3.0]| 34.62675902864264| |    4| 40.0| 40.0|   [4.0]| 39.65823474470739| |   14| 89.0| 89.0|  [14.0]| 89.97299190535493| |   16|102.0|102.0|  [16.0]|100.03594333748444| |   20|120.0|120.0|  [20.0]|120.16184620174346| |   22|131.0|131.0|  [22.0]|130.22479763387295| |   24|142.0|142.0|  [24.0]|140.28774906600245| +-----+-----+-----+--------+------------------+ Prediction for 40.0 guests is 220.79136052303852
  51. 51. CONFIDENTIAL © 2018 (the complex ML code) LinearRegression lr = new LinearRegression() .setMaxIter(40) .setRegParam(1) .setElasticNetParam(1); LinearRegressionModel model = lr.fit(df); Double feature = 40.0; Vector features = Vectors.dense(40.0); double p = model.predict(features); /jgperrin/net.jgp.labs.sparkdq4ml Define algorithms and its (hyper)parameters Created a model from our data Apply the model to a new dataset: predict
  52. 52. CONFIDENTIAL © 2018 Surprise!
  53. 53. CONFIDENTIAL © 2018 Using Spark to analyse the World Cup 2018 ScoreStatisticsApp: Counting the goals and other statistics from World Cup 2018 Goals by country +-------+----------+ |country|sum(score)| +-------+----------+ |Belgium| 16| | France| 14| |Croatia| 14| |England| 12| | Russia| 11| +-------+----------+ only showing top 5 rows /jgperrin/net.jgp.labs.spark.football
  54. 54. CONFIDENTIAL © 2018 Using Spark to analyse all the World Cups HistoricScoreStatisticsApp: Analyzing the attendance for all World Cups and other stats Attendance per year +----+----------+ |Year|Attendance| +----+----------+ |1930| 1181098| |1934| 726000| |1938| 751400| |1950| 2090492| |1954| 1537214| |1958| 1639620| /jgperrin/net.jgp.labs.spark.football
  55. 55. CONFIDENTIAL © 2018 Demo Using Spark to predict the next World Cup winner • Analysis of soccer tournaments during World Cup from 1930 to 2018 • Numerous data set from Kaggle, data.world • Potentially billions of data points • Heavy usage of Java /jgperrin/net.jgp.labs.spark.football
  56. 56. CONFIDENTIAL © 2018 Conclusion
  57. 57. CONFIDENTIAL © 2018 Key takeaways • Build your data lake in memory (and disk). • Store first, ask questions later. • Sample your data (to spot anomalies and more). • Build (and reuse!) business rules with the business people. • Use Spark dataframe, SQL, and UDFs to build a consistent & coherent dataset. • Rely on UDFs for prepping ML formats. • Use Java.
  58. 58. CONFIDENTIAL © 2018 Going further • Contact me @jgperrin • Hands-on tutorial at All Things Open (October in Raleigh, NC) • Join the Spark User mailing list • Get help from Stack Overflow • fb.com/TriangleSpark • Buy my book on Spark with Java in MEAP (ok, really shameless plug here)
  59. 59. CONFIDENTIAL © 2018 Going even further Spark with Java (MEAP) by Jean Georges Perrin (@jgperrin) published by Manning https://www.manning.com/books/spark-with-java sparkwjava-B108 sparkwithjava One free book 40% off

×