Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Processing

727 views

Published on

Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Processing by "Sampat Kumar" from "Harman". The presentation was done at #doppa17 DevOps++ Global Summit 2017. All the copyrights are reserved with the author

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Processing

  1. 1. HADOOP VS SPARK
  2. 2. YOUR PRESENTER – SAMPAT KUMAR BUDANKAYALA • Sr . Big Data Analyst @ Harman Solutions • Over 4.5 years of Big Data experience working on over 15-20 projects . • Specialist in Building Data Lake Projects, Data Security, Streaming Solutions(RealTime Ingestion),Linear Regression and Building Recommendation Systems . • Email: sampatbigdata@gmail.com • Linkedin:
  3. 3. AGENDA • Around the Globe (Spark and Hadoop) • Big Data, Big Data Stack, Apache Hadoop, Apache Spark. • What is Hadoop and What is Spark ? • SparkVs Hadoop and the combination effect. • Q & A
  4. 4. Around the Globe: NEWS: ---------- • Is it Spark ‘vs’ OR ‘and’ Hadoop. • Apache Spark is continuing beyond Apache Hadoop. SURVEYS: -------------- • Big Data, the analysis of large quantities of data to gain new insight has become a ubiquitous phrase in recent years. Day by day the data is growing at a staggering rate. One of the efficient technologies that deal with the Big Data is Hadoop. • Hadoop, for processing large data volume jobs uses MapReduce programming model. http://www.ijetae.com/files/Volume4Issue5/IJETAE_0514_15.pdf • Hadoop's historic focus on batch processing of data was well supported by MapReduce, but there is an appetite for more flexible developer tools to support the larger market of 'mid-size' datasets and use cases that call for real-time processing. http://www.marketwired.com/press-release/survey-indicates-apache-spark-gaining-developer- adoption-as-big-datas-projects-require-1986162.htm
  5. 5. Around the Globe Cont :
  6. 6. Big Data, Big Data Stack, Apache Spark and Hadoop Big Data ---------- • Big data is a term that describes the large volume of data –structured ,semi-structured and unstructured . • But it’s not the amount of data that’s important. It’s what organizations do with the data that matters. Big data can be analyzed for insights that lead to better decisions and strategic business moves. • The concept gained momentum in the early 2000s when industry analysts articulated the now- mainstream definition of big data as the threeVs: Volume – Organizations collect data from a variety of sources, including business transactions, social media and information from sensor or machine-to-machine data. In the past, storing it would’ve been a problem – but new technologies (such as Hadoop) have eased the burden. Velocity – Data streams in at an unprecedented speed and must be dealt with in a timely manner. RFID tags, sensors and smart metering are driving the need to deal with torrents of data in near-real time. Variety – Data comes in all types of formats – from structured, numeric data in traditional databases to unstructured text documents, email, video, audio, stock ticker data and financial transactions. https://www.zettaset.com/index.php/info-center/what-is-big-data/
  7. 7. Big Data, Big Data Stack, Apache Spark and Hadoop Big Data Stack -------------------
  8. 8. Big Data, Big Data Stack, Apache Spark and Hadoop Apache Hadoop --------------------- • Hadoop is a framework designed to work with huge amount of data sets which is much larger in magnitude than the normal systems can handle. • Hadoop distributes this data across a set of machines.The real power of Hadoop comes from the fact its competence to scalable to hundreds or thousands of computers each containing several processor cores. • Many big enterprises believe that within a few years more than half of the world’s data will be stored in Hadoop. • Hadoop mainly consists of: 1. Hadoop Distributed File System (HDFS): a distributed file system to achieve storage and fault tolerance 2. Hadoop MapReduce a powerful parallel programming model which processes vast quantity of data via distributed computing across the clusters.
  9. 9. Big Data, Big Data Stack, Apache Spark and Hadoop Apache Spark --------------------- • Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets. • Apache Spark consists of Spark Core and a set of libraries.The core is the distributed execution engine and the Java, Scala, and Python APIs offer a platform for distributed ETL application development. • Spark is designed for data science and its abstraction makes data science easier. Data scientists commonly use machine learning – a set of techniques and algorithms that can learn from data. • Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.
  10. 10. Spark Vs Hadoop and the combination effect Performance ----------------- • Apache Spark processes data in-memory while Hadoop MapReduce persists back to the disk after a map or reduce action, so Spark should outperform Hadoop MapReduce. • Nonetheless, Spark needs a lot of memory. Much like standard DBs, it loads a process into memory and keeps it there until further notice, for the sake of caching. • If Spark runs on HadoopYARN with other resource-demanding services, or if the data is too big to fit entirely into the memory, then there could be major performance degradations for Spark. • MapReduce, however, kills its processes as soon as a job is done, so it can easily run alongside other services with minor performance differences. • Bottom line: Spark performs better when all the data fits in the memory, especially on dedicated clusters; Hadoop MapReduce is designed for data that doesn’t fit in the memory and it can run well alongside other services.
  11. 11. Spark Vs Hadoop and the combination effect Ease Of User: ----------------- • Spark has comfortable APIs for Java, Scala and Python, and also includes Spark SQL (formerly known as Shark) for the SQL savvy. • Hadoop MapReduce is written in Java and is infamous for being very difficult to program. Pig makes it easier, though it requires some time to learn the syntax, and Hive adds SQL compatibility to the plate. • MapReduce doesn’t have an interactive mode, although Hive includes a command line interface. Projects like Impala, Presto andTez want to bring full interactive querying to Hadoop. Bottom line: Spark is easier to program and includes an interactive mode; Hadoop MapReduce is more difficult to program but many tools are available to make it easier.
  12. 12. Spark Vs Hadoop and the combination effect Cost: ----------------- • Both Spark and Hadoop MapReduce are open source, but money still needs to be spent on machines and staff. • Hardware Requirements. • The memory in the Spark cluster should be at least as large as the amount of data you need to process, because the data has to fit into the memory for optimal performance. So, if you need to process really Big Data, Hadoop will definitely be the cheaper option since hard disk space comes at a much lower rate than memory space. • Furthermore, there is a wide array of Hadoop-as-a-service offerings and Hadoop-based, which help to skip the hardware and staffing requirements. In comparison, there are few Spark-as-a-service options and they are all very new. • Bottom line: Spark is more cost-effective according to the benchmarks, though staffing could be more costly; Hadoop MapReduce could be cheaper because more personnel are available and because of Hadoop-as-a-service offerings.
  13. 13. Spark Vs Hadoop and the combination effect Data Processing: ---------------------- • Apache Spark can do more than plain data processing: it can process graphs and use the existing machine-learning libraries. • Spark can do real-time processing as well as batch processing. • Hadoop MapReduce is great for batch processing. If you want a real-time option you’ll need to use another platform like Storm or Impala, and for graph processing you can use Giraph. MapReduce used to have Apache Mahout for machine learning, but the elephant riders have ditched it in favor of Spark and h2o. • Bottom line: Spark is key for real time data processing; Hadoop MapReduce is the key for batch processing.
  14. 14. Spark Vs Hadoop and the combination effect FailureTolerance: ---------------------- • Spark has retries per task and speculative execution—just like MapReduce. Nonetheless, because MapReduce relies on hard drives, if a process crashes in the middle of execution, it could continue where it left off, whereas Spark will have to start processing from the beginning.This can save time. • Bottom line: Spark and Hadoop MapReduce both have good failure tolerance, but Hadoop MapReduce is slightly more tolerant. Security: ------------------ • Spark is a bit bare at the moment when it comes to security. • Spark can run onYARN and use HDFS, which means that it can also enjoy Kerberos authentication, HDFS file permissions and encryption between nodes. • Hadoop MapReduce can enjoy all the Hadoop security benefits and integrate with Hadoop security projects, like Knox Gateway and Sentry. • Bottom line: Spark security is still in its infancy; Hadoop MapReduce has more security features and projects.
  15. 15. Practical Demo On Performance and Ease of Using API’s
  16. 16. Reference Links: https://www.xplenty.com/blog/2014/11/apache-spark-vs-hadoop-mapreduce/

×