Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Processing
Sep. 11, 2017•0 likes•1,530 views
Report
Technology
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Processing by "Sampat Kumar" from "Harman". The presentation was done at #doppa17 DevOps++ Global Summit 2017. All the copyrights are reserved with the author
2. YOUR PRESENTER – SAMPAT KUMAR BUDANKAYALA
• Sr . Big Data Analyst @ Harman Solutions
• Over 4.5 years of Big Data experience working on over 15-20 projects .
• Specialist in Building Data Lake Projects, Data Security, Streaming
Solutions(RealTime Ingestion),Linear Regression and Building
Recommendation Systems .
• Email: sampatbigdata@gmail.com
• Linkedin:
3. AGENDA
• Around the Globe (Spark and Hadoop)
• Big Data, Big Data Stack, Apache Hadoop, Apache Spark.
• What is Hadoop and What is Spark ?
• SparkVs Hadoop and the combination effect.
• Q & A
4. Around the Globe:
NEWS:
----------
• Is it Spark ‘vs’ OR ‘and’ Hadoop.
• Apache Spark is continuing beyond Apache Hadoop.
SURVEYS:
--------------
• Big Data, the analysis of large quantities of data to gain new insight has become a ubiquitous phrase in
recent years. Day by day the data is growing at a staggering rate. One of the efficient technologies that
deal with the Big Data is Hadoop.
• Hadoop, for processing large data volume jobs uses MapReduce programming model.
http://www.ijetae.com/files/Volume4Issue5/IJETAE_0514_15.pdf
• Hadoop's historic focus on batch processing of data was well supported by MapReduce, but there is an
appetite for more flexible developer tools to support the larger market of 'mid-size' datasets and use
cases that call for real-time processing.
http://www.marketwired.com/press-release/survey-indicates-apache-spark-gaining-developer-
adoption-as-big-datas-projects-require-1986162.htm
6. Big Data, Big Data Stack, Apache Spark and Hadoop
Big Data
----------
• Big data is a term that describes the large volume of data –structured ,semi-structured and unstructured .
• But it’s not the amount of data that’s important. It’s what organizations do with the data that matters. Big data can
be analyzed for insights that lead to better decisions and strategic business moves.
• The concept gained momentum in the early 2000s when industry analysts articulated the now- mainstream
definition of big data as the threeVs:
Volume – Organizations collect data from a variety of sources, including business transactions, social media and
information from sensor or machine-to-machine data. In the past, storing it would’ve been a problem – but new
technologies (such as Hadoop) have eased the burden.
Velocity – Data streams in at an unprecedented speed and must be dealt with in a timely manner. RFID tags, sensors and
smart metering are driving the need to deal with torrents of data in near-real time.
Variety – Data comes in all types of formats – from structured, numeric data in traditional databases to unstructured text
documents, email, video, audio, stock ticker data and financial transactions.
https://www.zettaset.com/index.php/info-center/what-is-big-data/
7. Big Data, Big Data Stack, Apache Spark and Hadoop
Big Data Stack
-------------------
8. Big Data, Big Data Stack, Apache Spark and Hadoop
Apache Hadoop
---------------------
• Hadoop is a framework designed to work with huge amount of data sets which is much larger in magnitude than
the normal systems can handle.
• Hadoop distributes this data across a set of machines.The real power of Hadoop comes from the fact its
competence to scalable to hundreds or thousands of computers each containing several processor cores.
• Many big enterprises believe that within a few years more than half of the world’s data will be stored in Hadoop.
• Hadoop mainly consists of:
1. Hadoop Distributed File System (HDFS): a distributed file system to achieve storage and fault tolerance
2. Hadoop MapReduce a powerful parallel programming model which processes vast quantity of data via
distributed computing across the clusters.
9. Big Data, Big Data Stack, Apache Spark and Hadoop
Apache Spark
---------------------
• Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to
allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast
iterative access to datasets.
• Apache Spark consists of Spark Core and a set of libraries.The core is the distributed execution engine and the
Java, Scala, and Python APIs offer a platform for distributed ETL application development.
• Spark is designed for data science and its abstraction makes data science easier. Data scientists commonly use
machine learning – a set of techniques and algorithms that can learn from data.
• Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.
10. Spark Vs Hadoop and the combination effect
Performance
-----------------
• Apache Spark processes data in-memory while Hadoop MapReduce persists back to the disk after a map or
reduce action, so Spark should outperform Hadoop MapReduce.
• Nonetheless, Spark needs a lot of memory. Much like standard DBs, it loads a process into memory and keeps it
there until further notice, for the sake of caching.
• If Spark runs on HadoopYARN with other resource-demanding services, or if the data is too big to fit entirely into
the memory, then there could be major performance degradations for Spark.
• MapReduce, however, kills its processes as soon as a job is done, so it can easily run alongside other services with
minor performance differences.
• Bottom line: Spark performs better when all the data fits in the memory, especially on dedicated clusters; Hadoop
MapReduce is designed for data that doesn’t fit in the memory and it can run well alongside other services.
11. Spark Vs Hadoop and the combination effect
Ease Of User:
-----------------
• Spark has comfortable APIs for Java, Scala and Python, and also includes Spark SQL (formerly known as Shark) for
the SQL savvy.
• Hadoop MapReduce is written in Java and is infamous for being very difficult to program. Pig makes it easier,
though it requires some time to learn the syntax, and Hive adds SQL compatibility to the plate.
• MapReduce doesn’t have an interactive mode, although Hive includes a command line interface. Projects like
Impala, Presto andTez want to bring full interactive querying to Hadoop.
Bottom line: Spark is easier to program and includes an interactive mode; Hadoop MapReduce is more difficult to
program but many tools are available to make it easier.
12. Spark Vs Hadoop and the combination effect
Cost:
-----------------
• Both Spark and Hadoop MapReduce are open source, but money still needs to be spent on machines and staff.
• Hardware Requirements.
• The memory in the Spark cluster should be at least as large as the amount of data you need to process, because
the data has to fit into the memory for optimal performance. So, if you need to process really Big Data, Hadoop
will definitely be the cheaper option since hard disk space comes at a much lower rate than memory space.
• Furthermore, there is a wide array of Hadoop-as-a-service offerings and Hadoop-based, which help to skip the
hardware and staffing requirements. In comparison, there are few Spark-as-a-service options and they are all very
new.
• Bottom line: Spark is more cost-effective according to the benchmarks, though staffing could be more costly;
Hadoop MapReduce could be cheaper because more personnel are available and because of Hadoop-as-a-service
offerings.
13. Spark Vs Hadoop and the combination effect
Data Processing:
----------------------
• Apache Spark can do more than plain data processing: it can process graphs and use the existing machine-learning
libraries.
• Spark can do real-time processing as well as batch processing.
• Hadoop MapReduce is great for batch processing. If you want a real-time option you’ll need to use another
platform like Storm or Impala, and for graph processing you can use Giraph. MapReduce used to have Apache
Mahout for machine learning, but the elephant riders have ditched it in favor of Spark and h2o.
• Bottom line: Spark is key for real time data processing; Hadoop MapReduce is the key for batch processing.
14. Spark Vs Hadoop and the combination effect
FailureTolerance:
----------------------
• Spark has retries per task and speculative execution—just like MapReduce. Nonetheless, because MapReduce
relies on hard drives, if a process crashes in the middle of execution, it could continue where it left off, whereas
Spark will have to start processing from the beginning.This can save time.
• Bottom line: Spark and Hadoop MapReduce both have good failure tolerance, but Hadoop MapReduce is
slightly more tolerant.
Security:
------------------
• Spark is a bit bare at the moment when it comes to security.
• Spark can run onYARN and use HDFS, which means that it can also enjoy Kerberos authentication, HDFS file
permissions and encryption between nodes.
• Hadoop MapReduce can enjoy all the Hadoop security benefits and integrate with Hadoop security projects, like
Knox Gateway and Sentry.
• Bottom line: Spark security is still in its infancy; Hadoop MapReduce has more security features and projects.