Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Fast data and Streaming analytics


Published on

In the past decade a number of technologies have revolutionized the way we do analytics in banking. In this talk we would like to summarize this journey from classical statistical offline modeling to the latest real-time streaming predictive analytical techniques. In particular, we will look at hadoop and how this distributing computing paradigm has evolved with the advent of in-memory computing and distributed machine learning using Spark.Finally, we will describe how to make data science actionable and how to overcome some of the limitations of current batch processing with streaming analytics.

We are living the big data revolution. But what about fast data? Analytics on recent data is becoming increasingly relevant, since it provides better insight and better models in a world of rapidly changing trends and conditions.

Streaming Analytics allows to compute and process data and events as soon as they enter the data system, providing unprecedented levels of reactiveness. Customers are enjoying live, personalized information streams. Companies can be more effective with respect to marketing, security, operation excellence and business process management.

In this talk we will start from traditional batch processes, touching upon the latest development about big data and hadoop, to move further into the world of fast moving data.

We will explore some of the bespoken systems and tools in streaming Analytics such as Spark, Samza, Kafka, Akka and describe some typical it architectures and data processing related to streaming data. Finally we will look at how to combine Streaming Data with an existing batch, off-line analytical solution.

Presented at Big Data & Analytics Innovation Summit
The Innovation Enterprise, November 11 & 12, London, 2015

Published in: Data & Analytics

Fast data and Streaming analytics

  1. 1. Fast Data and Streaming Analytics Natalino Busa Enterprise Data Architect at ING The Evolution of Data Analytics
  2. 2. @natbusa | Natalino Busa ING group
  3. 3. @natbusa | linkedin: Natalino Busa ING group Clear and Easy Anytime, Anywhere Empower Keep getting better
  4. 4. @natbusa | Natalino Busa about: how to grok data with machines and keep up with changing times & techs
  5. 5. @natbusa | Natalino Busa Analytics goes mainstream (70s, 80s) ● The Relational Database is born! 1972: E.F. Codd relational database model, normalization: (free from insertion, deletion and update anomalies) 1978: Peter Chen, The entity-relationship model
  6. 6. @natbusa | Natalino Busa Exploratory Data Analysis In 1977, Tukey published Exploratory Data Analysis, arguing that more emphasis needed to be placed on using data to suggest hypotheses to test and that Exploratory Data Analysis and Confirmatory Data Analysis “can—and should—proceed side by side.” Analytics goes mainstream (70s, 80s)
  7. 7. @natbusa | Natalino Busa ● 1995: Amazon ● 1995: eBay ● 1996: HotMail ● 1998: Google ● 1998: Paypal Internet goes Global (90s)
  8. 8. @natbusa | Natalino Busa Knowledge Data in Databases (1996)
  9. 9. @natbusa | Natalino Busa ● Analytics (OLAP): Long queries, aggregations, data mining, reporting, models ● Operations (OLTP): Fast transactions, ACID, consistent, available, fault-tolerant The internet goes global (90s)
  10. 10. @natbusa | Natalino Busa The World goes Social (00s) Web apps go in hyper - growth ● 2003: LinkedIn ● 2003: Skype ● 2004: Facebook ● 2006: Twitter
  11. 11. @natbusa | Natalino Busa more users, events, transactions.
  12. 12. @natbusa | Natalino Busa The Advent of MPP ALAPs (Early 00s) ● Massive multi-rack systems ● 100’s of Computing Cores ● 100’s Terabytes of Storage ● Distributed computing ● Advanced Query Plans ● Columnar Data Models ● Re-programmable hardware
  13. 13. @natbusa | Natalino Busa ● Simpler programming paradigm ● Distributed, Replicated File System Map-Reduce and Hadoop (Early 00s)
  14. 14. @natbusa | Natalino Busa Map-Reduce and Hadoop (Early 00s)
  15. 15. @natbusa | Natalino Busa Hadoop or MPPs or both?
  16. 16. @natbusa | Natalino Busa ● MPP for speed and accuracy, well structured data ● Hadoop for size, flexibility, raw files Hadoop and MPPs (00s) Diagram from:
  17. 17. @natbusa | Natalino Busa The Rise of the Data Scientist (00s)
  18. 18. @natbusa | Natalino Busa ● WhatsApp: in a day ● 31 billion messages sent ● 700 million photo’s sent Fast Data, API, Mobile and IoT (10s)
  19. 19. @natbusa | Natalino Busa Stream Centric Architectures (10s)
  20. 20. @natbusa | Natalino Busa Stream Centric Architectures (10s)
  21. 21. @natbusa | Natalino Busa Stream Centric Architectures (10s)
  22. 22. @natbusa | Natalino Busa Stream Centric Architectures (10s) ● Streaming events ● Resilient, Scalable ● Publisher and Subscribers ● Distributed Queue
  23. 23. @natbusa | Natalino Busa New Problems: ● Hadoop is getting too slow (File -> File) ● Productivity of Data Science goes down ● SQL is not enough ● Distributed Machine Learning algorithms? Fast Data, API, Mobile and IoT (10s)
  24. 24. @natbusa | Natalino Busa 10 yrs 5 yrs 1 yr 1 month 1 day 1hour 1m time population:events,transactions, sessions,customers,etc Customer Journey Analytics Recent data streaming analytics historical big data Streaming and Real-Time Analytics (10s)
  25. 25. @natbusa | Natalino Busa25 Distributed Data Store Fast Analytics Event Processing Real Time APIs Streaming Data Data Modeling Data Sources, Files, DB extracts Batched Data Alerts and Notifications API for mobile and web Training, Scoring and Exposing models read the model read the data write the model
  26. 26. @natbusa | Natalino Busa in-memory computing is winning! Spark is emerging as an improved, faster, better, “new” hadoop. The RAM is the new Disk (10s)
  27. 27. @natbusa | Natalino Busa Unified Distributed Computing paradigm: SQL, Statistics Machine Learning Graph Analytics @natalinobusa | Natalino Busa Integrated Data Science (10s) Polyglot Programming: R Python Scala Java
  28. 28. @natbusa | Natalino Busa Spark Streaming SQL MLlib Graphx Analytics, Statistics, Data Science, Model Training HDFS NoSQL SQL Data Sources Map-Reduce HDFS KAFKA Spark: Hadoop evolved (10s)
  29. 29. @natbusa | Natalino Busa Kafka + Spark + Cassandra + Akka (noSQL stack, Fast Data) MPP + HDFS + Spark (“new” Hadoop / Data Lake) @natalinobusa | Natalino Busa Popular Operational Analytics Stacks (10s)
  30. 30. @natbusa | Natalino Busa Micro-Batch and Event Streaming Analytics - Micro-Batch (Spark Streaming) - Log Oriented (Kafka, Samza) - NewSQL (VoldDB) - Streaming computing (MillWheel, Flink, Apex) Kings and new entries (10s, 20s)
  31. 31. @natbusa | Natalino Busa - Deep Learning Data Science new trends (10s)
  32. 32. @natbusa | Natalino Busa Deep Learning to assist doctors treating and classifying cancer Data Science new trends (10s)
  33. 33. @natbusa | Natalino Busa - Deep Learning Data Science new trends (10s) DL4J Theano TensorFlow
  34. 34. @natbusa | Natalino Busa - Topological Data Analysis Analyze high-dimensional data, visually Analysis of NetFlix Prize Dataset. Data sets statistics: ● 100,480,507 ratings ● 480,189 users ● 17,770 movies ● 2.8 GB CSV file size Data Science new trends (10s)
  35. 35. @natbusa | Natalino Busa 1) SQL + Machine Learning 2) Diversity in your team: great asset 3) Data science: R-Scala-Python-Java Takeaways: Data Science
  36. 36. @natbusa | Natalino Busa 1) Memory is King 2) The “Event Stream” 2) Spark is the new Hadoop Takeaways: Techs
  37. 37. @natbusa | Natalino Busa It starts and end with people. Value the experience not the tools Takeaways: Customer’s Journey
  38. 38. @natbusa | Natalino Busa Distributed computing Machine Learning Statistics Big/Fast Data Streaming Computing