Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

New trends in big data: in-memory analytics, streaming computing and distributed machine learning


Published on

New trends in Big Data:
In-memory analytics, streaming computing and distributed machine learning

The ability to understand data is a central ingredient to solve many real-world problems such as customer churn, cost optimization and fraud detection. Data processing is usually arranged as a pipeline made of several steps such as data extraction, data preparation, training and scoring. Big data technologies can be used to parallelise the whole pipeline, hence providing resilient, scalable and distributed data models.

Since the introduction of Hadoop 10 years back, memory and SSD technologies have become cheaper and more accessible. What was possible back in the days on disk is today possible on solid state technology. Apache Spark makes use of in-memory distributed data structures to accelerate data analytics.

Originally, Big Data and in particular Hadoop was designed to operate on large chunks of data (terabytes and petabytes). However, it's does not perform well to process single events or to provide low-latency actionable results. Streaming computing attempts to provide a single data processing paradigm which is works well for both "fast" and "big" data.

Published in: Data & Analytics

New trends in big data: in-memory analytics, streaming computing and distributed machine learning

  1. 1. Trends in Big Data. Natalino Busa Data Platform Architect at Ing
  2. 2. Play with your phones
  3. 3. Re-think Big Data Hadoop has turned 10
  4. 4. Memory is eating Big Data Amazon is delivering instances with 2 TB RAM Facebook, Microsoft: 90% workload below the 100 GB Machine Learning algorithms fit on a single node
  5. 5. 250 MB hard disk drive from 1979 I like Big Data and I cannot lie.
  6. 6. Disk -> RAM Hadoop -> Spark Map-Reduce -> Data Flow Graphs HDFS -> Storage, MPPs, NoSQL
  7. 7. Wheel mill. Stream like a boss
  8. 8. Streaming and Real-Time Analytics
  9. 9. Batch -> Event-Driven ETL -> Streaming Hive -> Flink, Akka, Spark
  10. 10. Stream Centric Architectures
  11. 11. Spark - RDDs Streaming SQL MLlib Graphx Analytics, Statistics, Data Science, Model Training HDFS NoSQL SQL Data Sources Map-Reduce HDFS KAFKA Spark: Unified Distributed Computing: SQL + Machine Learning + Graph Analytics Hive
  12. 12. Virtual resources Big Data Applications, Assemble!
  13. 13. Clusters -> Resources Orchestrated -> Isolated Static -> Disposable YARN, MESOS, CoreOS, Kubernetes Application-oriented Infrastructure
  14. 14. Elastic: Docker, Mesos, Yarn, Kubernetes Data Processing: Flink, Spark, Akka Indexing: Elastic Search, Deep Learning APIs and microservices: Akka, Python, Java Data storage: SQL, NoSQL, HDFS, Streaming MESOS, YARN Spark Streaming SQL MLlib Graphx DBs ES C* Application Oriented Architectures
  15. 15. That’s all folks! Natalino Busa Data Platform Architect at Ing @natbusa