Big Data tools in practice

Big Data tools in practice
Darko Marjanović, darko@thingsolver.com
Miloš Milovanović, milos@thingsolver.com

Agenda
• Hadoop
• Spark
• Python

Hadoop
• Pros
• Linear scalability.
• Commodity hardware.
• Pricing and licensing.
• Any data types.
• Analytical queries.
• Integration with traditional
systems.
• Cons
• Implementation.
• Map Reduce ease of use.
• Intense calculations with little
data.
• In memory.
• Real time analytics.
The Apache Hadoop software library is a framework that allows the
distributed processing of large data sets across clusters of computers using
simple programming models.

Hadoop
• Hadoop Common
• HDFS
• Map Reduce
• YARN

Apache Spark
• Pros
• 100X faster than Map Reduce.
• Ease of use.
• Streaming, Mllib, Graph and SQL.
• Pricing and licensing.
• In memory.
• Integration with Hadoop.
• Cons
• Integration with traditional
systems.
• Limited memory per machine(GC).
• Configuration.
Apache Spark is a fast and general engine for big data processing, with
built-in modules for streaming, SQL, machine learning and graph
processing.

Resilient Distributed Datasets
A distributed memory abstraction that allows programmers to perform in-memory computations
on large clusters while retaining the fault tolerance of data flow model like MapReduce.*
• Immutability
• Lineage (reconstruct lost partitions)
• Fault tolerance through logging updates made to a dataset (single operation applied to
many records)
• Creation:
• Reading a dataset from storage (HDFS or any other)
• From other RDDs
*Technical Report No. UCB/EECS-2011-82, available at: http://www.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-82.html

RDD operations
• Transformations
• Lazy evaluated (executed by calling
an action)
• Reduces wait states
• Better pipelining
• Actions
• Runned immediately
• Return value to the application or
export to storage system
• map(f : T ⇒ U)
• filter(f : T ⇒ Bool)
• groupByKey()
• join()
• count()
• collect()
• reduce(f : (T, T) ⇒ T)
• save(path: String)

Spark program lifecycle
Create RDD
(external data or parallelize collection)
Transformation
(lazy evaluated)
Cache RDD
(for reuse)
Action
(execute computation and return results)

Spark in a cluster mode
* http://spark.apache.org/docs/latest/img/cluster-overview.png

PySpark
• Python API for Spark
• Easy-to-use programming abstraction and parallel runtime:
• “Here’s an operation, run it on all of the data”
• Dynamically typed (RDDs can hold objects of multiple types)
• Integrate with other Python libraries, such as Numpy, Pandas, Scikit-learn, Flask
• Run Spark from Jupyter notebooks

Spark Dataframes
DataFrames are a common data science abstraction that go across languages.
A data frame is a table, or two-dimensional array-like structure, in which each column
contains measurements on one variable, and each row contains one case.
A Spark DataFrame is a distributed collection of data organized into named columns, and
can be created:
• - from structured data files
• - from Hive tables
• - from external databases
• - from RDDs
Some supported operations:
- slice data
• - sort data
• - aggregate data
• - join with other dataframes

Dataframe benefits
• Lazy evaluation
• Domain specific language for distributed data manipulation
• Automatic parallelization and cluster distribution
• Integration with pipeline API for Mllib
• Query structured data with SQL (using SQLContext)
• Integration with Pandas Dataframes (and other Python data libraries)
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.json("data.json")
df.show()
df.select(“id”).show()
df.filter(df[”id”] > 10).show()
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.json("data.json")
df.registerTempTable(“data”)
results = sqlContext.sql(“SELECT * FROM data WHERE id > 10”)

Pandas DF vs Spark DF
Single machine tool (all data needs to fit
to memory, except with HDF5)
Distributed (data > memory)
Better API Good API
No parallelism Parallel by default
Mutable Immutable
Some function differences – reading data, counting, displaying, inferring types, statistics, creating new columns
(https://medium.com/@chris_bour/6-differences-between-pandas-and-spark-dataframes-1380cec394d2 )

A very popular benchmark
* https://databricks.com/wp-content/uploads/2015/02/Screen-Shot-2015-02-16-at-9.46.39-AM-1024x457.png

Big Data tools in practice

More Related Content

What's hot

Similar to Big Data tools in practice

More from Darko Marjanovic

Recently uploaded

Big Data tools in practice