Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big data trends challenges opportunities

1,466 views

Published on

Deck presented at the Georgia University. I was invited to give a talk on Big Data and related technologies, including Spark.

Published in: Data & Analytics
  • Be the first to comment

Big data trends challenges opportunities

  1. 1. Big Data Trends, Challenges, and Opportunities Mohammed Guller Jan 30, 2015
  2. 2. About Me  Principal Architect at Glassbeam  Founded two startups  Passionate about building products, big data analytics, and machine learning www.linkedin.com/in/mohammedguller @MohammedGuller 3 Available on Amazon
  3. 3. Functional Programming
  4. 4. CPU Trend  CPU clock speed plateaued around 2004  CPUs are not getting any faster  Trend is to add more cores/CPU and more CPUs/system 5
  5. 5. Challenges  Multi-threaded programs required to utilize all cores in a machine  Writing multi-threaded program is hard  Tools provided by traditional languages are primitive  Problems such as deadlocks, livelocks, starvation, and race conditions are difficult to avoid and detect 6
  6. 6. Functional Programming (FP)  Based on theory developed in the 1930s  Program composed of functions – Executed by evaluating expressions  Functions are first-class citizens – Can be passed as an argument to another function – Can be returned by another function – Can be defined inside another function – Can be defined as an unnamed literal similar to a string literal  Functions do not have side effect – Always returns the same output for a given input – Order of execution is not important  Discourages mutable variables 7
  7. 7. Benefits of Functional Programming  Makes it easier to write multi-threaded programs  Improves developer productivity  Enables better quality code 8
  8. 8. Functional Programming Languages  Lisp  Erlang  Haskell  Scala  Swift 9
  9. 9. Opportunities  High demand for people who know Scala – Scala is one of the most popular FP languages  Shortage of people who know Scala 10
  10. 10. Big Data
  11. 11. 3 Vs of Big Data Volume Scale of Data Variety Diversity of Data Velocity Speed of Data
  12. 12. Amount of Data Generated is Exploding 13
  13. 13. 5x More Connected Things Than People by 2020 14 Network of objects embedded with software for collecting and exchanging data over the Internet
  14. 14. Big Data Challenges  Storage – Traditional SAN and NAS storage devices are expensive  Processing – Traditional RDBMS were not designed to handle big data  How to get value out of data  How to do it economically 15
  15. 15. Open-source Big Data Storage Technologies  Distributed File Systems – HDFS  NoSQL data stores – Cassandra – HBase – MongoDB – Druid – ElasticSearch – SolrCloud 16
  16. 16. How Much Data Can a Standard Server Process 100 GB 10 TB 100 TB1 TB
  17. 17. Options For Increasing Data Processing Power  Scale-up  Scale-out 18
  18. 18. Scale-up  Use a more powerful high-end server – Faster CPU – Faster Disk – Large number of CPUs – Large amount of memory  Proprietary  Expensive  Limited scalability 19
  19. 19. Scale-out  Use a cluster of commodity servers  Inexpensive  Economical to scale  Preferred architecture 20
  20. 20. Challenges With Scale-out Architecture  Writing an distributed application is even harder than writing a multi-threaded one  Many details involved – Split a workload into chunks that can be distributed across a cluster – Schedule compute resources among different jobs – Manage inter-node communication – Handle network and node failures  Hardware failures are more common at a cluster level – Probability of a single node failing is very low – Probability of any one node failing from a cluster of thousands of nodes is very high 21
  21. 21. Getting Value Out of Data  Traditional analytics / BI  Machine Learning – Predictive analytics – Train software to do human tasks 22
  22. 22. Traditional Analytics / BI  What happened – Revenue growth for the last month/quarter/year – Customer growth for the last month/quarter/year  Why it happened – Why profit dropped – Why sales dropped  Other insights – What is the country-wise breakup of people downloading an app – How much time people spend in an app 23
  23. 23. Predictive Analytics  Ask software to predict – What product will a customer most likely buy – What ad will a visitor most likely click – What movies/songs/books will a customer like – What are chances that a patient may have an heart attack  More interesting and valuable than traditional analytics 24
  24. 24. Train Software To Do Human Tasks  Image classification – Facebook – Flickr  Voice recognition and natural language processing – Siri  Body movement recognition – Xbox Kinect  Self-driving car – Google car  Medical diagnosis  Anomaly detection – Fraudulent transaction – Security attack 25
  25. 25. Distributed Data Processing Frameworks  Batch processing – MapReduce  Stream processing – Samza – Heron – Storm  Batch and stream processing – Spark – Flink – Apex 26
  26. 26. Spark 27 Fast, easy-to-use, and general-purpose cluster computing framework for processing large datasets
  27. 27. Supports a Variety of Data Sources 28
  28. 28. Spark Benefits  Makes it easy to write distributed data processing applications – Expressive API  Takes care of the messy details of distributed computing  Allows developers to just focus on the business logic – Same code works on a single computer or a cluster of nodes 29
  29. 29. Integrated Libraries for a Variety of Tasks 30 Spark Core Spark SQL GraphX Spark Streaming MLlib & Spark ML
  30. 30. Spark is Fast  In-memory computation  Advanced Directed Acyclic Graph (DAG) execution engine 32
  31. 31. Why In-memory Computation Matters 33 100 MB/s 500 MB/s 10 GB/s
  32. 32. Read Time Comparison 0 50 100 150 200 1 TB Time (Min) Data Read HDD SSD RAM 34
  33. 33. What Are People Using Spark For 35 Source: Databricks Survey 2015
  34. 34. Top Reasons For Using Spark 36 Source: Databricks Survey 2015
  35. 35. Adoption of Spark is Growing Rapidly
  36. 36. Opportunities  Big data will only get bigger – Everything will be data driven – New data-driven applications will be invented – Data will enable us to solve extremely difficult problems  Spark and other big data technologies are rapidly evolving  Strong demand for people who know how to store, process and get value out of big data 40
  37. 37. 41

×