Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Big Data
Trends, Challenges, and Opportunities
Mohammed Guller
Jan 30, 2015
About Me
 Principal Architect at Glassbeam
 Founded two startups
 Passionate about building products,
big data analytic...
Functional Programming
CPU Trend
 CPU clock speed plateaued around 2004
 CPUs are not getting any faster
 Trend is to add more cores/CPU and m...
Challenges
 Multi-threaded programs required to utilize all cores in a machine
 Writing multi-threaded program is hard
...
Functional Programming (FP)
 Based on theory developed in the 1930s
 Program composed of functions
– Executed by evaluat...
Benefits of Functional Programming
 Makes it easier to write multi-threaded programs
 Improves developer productivity
 ...
Functional Programming Languages
 Lisp
 Erlang
 Haskell
 Scala
 Swift
9
Opportunities
 High demand for people who know Scala
– Scala is one of the most popular FP languages
 Shortage of people...
Big Data
3 Vs of Big Data
Volume
Scale of Data
Variety
Diversity of Data
Velocity
Speed of Data
Amount of Data Generated is Exploding
13
5x More Connected Things Than People by 2020
14
Network of objects embedded with software for
collecting and exchanging da...
Big Data Challenges
 Storage
– Traditional SAN and NAS storage devices are expensive
 Processing
– Traditional RDBMS wer...
Open-source Big Data Storage Technologies
 Distributed File Systems
– HDFS
 NoSQL data stores
– Cassandra
– HBase
– Mong...
How Much Data Can a Standard Server Process
100
GB
10
TB
100
TB1
TB
Options For Increasing Data Processing Power
 Scale-up
 Scale-out
18
Scale-up
 Use a more powerful high-end server
– Faster CPU
– Faster Disk
– Large number of CPUs
– Large amount of memory
...
Scale-out
 Use a cluster of commodity servers
 Inexpensive
 Economical to scale
 Preferred architecture
20
Challenges With Scale-out Architecture
 Writing an distributed application is even harder than writing a
multi-threaded o...
Getting Value Out of Data
 Traditional analytics / BI
 Machine Learning
– Predictive analytics
– Train software to do hu...
Traditional Analytics / BI
 What happened
– Revenue growth for the last month/quarter/year
– Customer growth for the last...
Predictive Analytics
 Ask software to predict
– What product will a customer most likely buy
– What ad will a visitor mos...
Train Software To Do Human Tasks
 Image classification
– Facebook
– Flickr
 Voice recognition and natural
language proce...
Distributed Data Processing Frameworks
 Batch processing
– MapReduce
 Stream processing
– Samza
– Heron
– Storm
 Batch ...
Spark
27
Fast, easy-to-use, and general-purpose cluster
computing framework for processing large datasets
Supports a Variety of Data Sources
28
Spark Benefits
 Makes it easy to write distributed data processing applications
– Expressive API
 Takes care of the mess...
Integrated Libraries for a Variety of Tasks
30
Spark Core
Spark
SQL
GraphX
Spark
Streaming
MLlib &
Spark
ML
Spark is Fast
 In-memory computation
 Advanced Directed Acyclic Graph (DAG) execution engine
32
Why In-memory Computation Matters
33
100 MB/s
500 MB/s
10 GB/s
Read Time Comparison
0
50
100
150
200
1 TB
Time (Min)
Data Read
HDD
SSD
RAM
34
What Are People Using Spark For
35
Source: Databricks Survey 2015
Top Reasons For Using Spark
36
Source: Databricks Survey 2015
Adoption of Spark is Growing Rapidly
Opportunities
 Big data will only get bigger
– Everything will be data driven
– New data-driven applications will be inve...
41
Upcoming SlideShare
Loading in …5
×
Upcoming SlideShare
Top 10 Big Data Trends for 2017
Next

1

Share

Big data trends challenges opportunities

Deck presented at the Georgia University. I was invited to give a talk on Big Data and related technologies, including Spark.

Related Books

Free with a 30 day trial from Scribd

See all

Big data trends challenges opportunities

  1. 1. Big Data Trends, Challenges, and Opportunities Mohammed Guller Jan 30, 2015
  2. 2. About Me  Principal Architect at Glassbeam  Founded two startups  Passionate about building products, big data analytics, and machine learning www.linkedin.com/in/mohammedguller @MohammedGuller 3 Available on Amazon
  3. 3. Functional Programming
  4. 4. CPU Trend  CPU clock speed plateaued around 2004  CPUs are not getting any faster  Trend is to add more cores/CPU and more CPUs/system 5
  5. 5. Challenges  Multi-threaded programs required to utilize all cores in a machine  Writing multi-threaded program is hard  Tools provided by traditional languages are primitive  Problems such as deadlocks, livelocks, starvation, and race conditions are difficult to avoid and detect 6
  6. 6. Functional Programming (FP)  Based on theory developed in the 1930s  Program composed of functions – Executed by evaluating expressions  Functions are first-class citizens – Can be passed as an argument to another function – Can be returned by another function – Can be defined inside another function – Can be defined as an unnamed literal similar to a string literal  Functions do not have side effect – Always returns the same output for a given input – Order of execution is not important  Discourages mutable variables 7
  7. 7. Benefits of Functional Programming  Makes it easier to write multi-threaded programs  Improves developer productivity  Enables better quality code 8
  8. 8. Functional Programming Languages  Lisp  Erlang  Haskell  Scala  Swift 9
  9. 9. Opportunities  High demand for people who know Scala – Scala is one of the most popular FP languages  Shortage of people who know Scala 10
  10. 10. Big Data
  11. 11. 3 Vs of Big Data Volume Scale of Data Variety Diversity of Data Velocity Speed of Data
  12. 12. Amount of Data Generated is Exploding 13
  13. 13. 5x More Connected Things Than People by 2020 14 Network of objects embedded with software for collecting and exchanging data over the Internet
  14. 14. Big Data Challenges  Storage – Traditional SAN and NAS storage devices are expensive  Processing – Traditional RDBMS were not designed to handle big data  How to get value out of data  How to do it economically 15
  15. 15. Open-source Big Data Storage Technologies  Distributed File Systems – HDFS  NoSQL data stores – Cassandra – HBase – MongoDB – Druid – ElasticSearch – SolrCloud 16
  16. 16. How Much Data Can a Standard Server Process 100 GB 10 TB 100 TB1 TB
  17. 17. Options For Increasing Data Processing Power  Scale-up  Scale-out 18
  18. 18. Scale-up  Use a more powerful high-end server – Faster CPU – Faster Disk – Large number of CPUs – Large amount of memory  Proprietary  Expensive  Limited scalability 19
  19. 19. Scale-out  Use a cluster of commodity servers  Inexpensive  Economical to scale  Preferred architecture 20
  20. 20. Challenges With Scale-out Architecture  Writing an distributed application is even harder than writing a multi-threaded one  Many details involved – Split a workload into chunks that can be distributed across a cluster – Schedule compute resources among different jobs – Manage inter-node communication – Handle network and node failures  Hardware failures are more common at a cluster level – Probability of a single node failing is very low – Probability of any one node failing from a cluster of thousands of nodes is very high 21
  21. 21. Getting Value Out of Data  Traditional analytics / BI  Machine Learning – Predictive analytics – Train software to do human tasks 22
  22. 22. Traditional Analytics / BI  What happened – Revenue growth for the last month/quarter/year – Customer growth for the last month/quarter/year  Why it happened – Why profit dropped – Why sales dropped  Other insights – What is the country-wise breakup of people downloading an app – How much time people spend in an app 23
  23. 23. Predictive Analytics  Ask software to predict – What product will a customer most likely buy – What ad will a visitor most likely click – What movies/songs/books will a customer like – What are chances that a patient may have an heart attack  More interesting and valuable than traditional analytics 24
  24. 24. Train Software To Do Human Tasks  Image classification – Facebook – Flickr  Voice recognition and natural language processing – Siri  Body movement recognition – Xbox Kinect  Self-driving car – Google car  Medical diagnosis  Anomaly detection – Fraudulent transaction – Security attack 25
  25. 25. Distributed Data Processing Frameworks  Batch processing – MapReduce  Stream processing – Samza – Heron – Storm  Batch and stream processing – Spark – Flink – Apex 26
  26. 26. Spark 27 Fast, easy-to-use, and general-purpose cluster computing framework for processing large datasets
  27. 27. Supports a Variety of Data Sources 28
  28. 28. Spark Benefits  Makes it easy to write distributed data processing applications – Expressive API  Takes care of the messy details of distributed computing  Allows developers to just focus on the business logic – Same code works on a single computer or a cluster of nodes 29
  29. 29. Integrated Libraries for a Variety of Tasks 30 Spark Core Spark SQL GraphX Spark Streaming MLlib & Spark ML
  30. 30. Spark is Fast  In-memory computation  Advanced Directed Acyclic Graph (DAG) execution engine 32
  31. 31. Why In-memory Computation Matters 33 100 MB/s 500 MB/s 10 GB/s
  32. 32. Read Time Comparison 0 50 100 150 200 1 TB Time (Min) Data Read HDD SSD RAM 34
  33. 33. What Are People Using Spark For 35 Source: Databricks Survey 2015
  34. 34. Top Reasons For Using Spark 36 Source: Databricks Survey 2015
  35. 35. Adoption of Spark is Growing Rapidly
  36. 36. Opportunities  Big data will only get bigger – Everything will be data driven – New data-driven applications will be invented – Data will enable us to solve extremely difficult problems  Spark and other big data technologies are rapidly evolving  Strong demand for people who know how to store, process and get value out of big data 40
  37. 37. 41
  • MarcosColebrookSantamaria

    Feb. 5, 2016

Deck presented at the Georgia University. I was invited to give a talk on Big Data and related technologies, including Spark.

Views

Total views

1,934

On Slideshare

0

From embeds

0

Number of embeds

59

Actions

Downloads

0

Shares

0

Comments

0

Likes

1

×