Big Data tools in practice
Darko Marjanović, darko@thingsolver.com
Miloš Milovanović, milos@thingsolver.com
Agenda
• Hadoop
• Spark
• Python
Hadoop
• Pros
• Linear scalability.
• Commodity hardware.
• Pricing and licensing.
• Any data types.
• Analytical queries.
• Integration with traditional
systems.
• Cons
• Implementation.
• Map Reduce ease of use.
• Intense calculations with little
data.
• In memory.
• Real time analytics.
The Apache Hadoop software library is a framework that allows the
distributed processing of large data sets across clusters of computers using
simple programming models.
Hadoop
• Hadoop Common
• HDFS
• Map Reduce
• YARN
Hadoop HDFS
Hadoop HDFS
Apache Spark
• Pros
• 100X faster than Map Reduce.
• Ease of use.
• Streaming, Mllib, Graph and SQL.
• Pricing and licensing.
• In memory.
• Integration with Hadoop.
• Cons
• Integration with traditional
systems.
• Limited memory per machine(GC).
• Configuration.
Apache Spark is a fast and general engine for big data processing, with
built-in modules for streaming, SQL, machine learning and graph
processing.
Spark
Spark stack
Resilient Distributed Datasets
A distributed memory abstraction that allows programmers to perform in-memory computations
on large clusters while retaining the fault tolerance of data flow model like MapReduce.*
• Immutability
• Lineage (reconstruct lost partitions)
• Fault tolerance through logging updates made to a dataset (single operation applied to
many records)
• Creation:
• Reading a dataset from storage (HDFS or any other)
• From other RDDs
*Technical Report No. UCB/EECS-2011-82, available at: http://www.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-82.html
RDD operations
• Transformations
• Lazy evaluated (executed by calling
an action)
• Reduces wait states
• Better pipelining
• Actions
• Runned immediately
• Return value to the application or
export to storage system
• map(f : T ⇒ U)
• filter(f : T ⇒ Bool)
• groupByKey()
• join()
• count()
• collect()
• reduce(f : (T, T) ⇒ T)
• save(path: String)
Spark program lifecycle
Create RDD
(external data or parallelize collection)
Transformation
(lazy evaluated)
Cache RDD
(for reuse)
Action
(execute computation and return results)
Spark in a cluster mode
* http://spark.apache.org/docs/latest/img/cluster-overview.png
PySpark
• Python API for Spark
• Easy-to-use programming abstraction and parallel runtime:
• “Here’s an operation, run it on all of the data”
• Dynamically typed (RDDs can hold objects of multiple types)
• Integrate with other Python libraries, such as Numpy, Pandas, Scikit-learn, Flask
• Run Spark from Jupyter notebooks
Spark Dataframes
DataFrames are a common data science abstraction that go across languages.
A data frame is a table, or two-dimensional array-like structure, in which each column
contains measurements on one variable, and each row contains one case.
A Spark DataFrame is a distributed collection of data organized into named columns, and
can be created:
• - from structured data files
• - from Hive tables
• - from external databases
• - from RDDs
Some supported operations:
- slice data
• - sort data
• - aggregate data
• - join with other dataframes
Dataframe benefits
• Lazy evaluation
• Domain specific language for distributed data manipulation
• Automatic parallelization and cluster distribution
• Integration with pipeline API for Mllib
• Query structured data with SQL (using SQLContext)
• Integration with Pandas Dataframes (and other Python data libraries)
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.json("data.json")
df.show()
df.select(“id”).show()
df.filter(df[”id”] > 10).show()
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.json("data.json")
df.registerTempTable(“data”)
results = sqlContext.sql(“SELECT * FROM data WHERE id > 10”)
Pandas DF vs Spark DF
Single machine tool (all data needs to fit
to memory, except with HDF5)
Distributed (data > memory)
Better API Good API
No parallelism Parallel by default
Mutable Immutable
Some function differences – reading data, counting, displaying, inferring types, statistics, creating new columns
(https://medium.com/@chris_bour/6-differences-between-pandas-and-spark-dataframes-1380cec394d2 )
A very popular benchmark
* https://databricks.com/wp-content/uploads/2015/02/Screen-Shot-2015-02-16-at-9.46.39-AM-1024x457.png
Big Data tools in practice
Darko Marjanović, darko@thingsolver.com
Miloš Milovanović, milos@thingsolver.com

Big Data tools in practice

  • 1.
    Big Data toolsin practice Darko Marjanović, darko@thingsolver.com Miloš Milovanović, milos@thingsolver.com
  • 2.
  • 3.
    Hadoop • Pros • Linearscalability. • Commodity hardware. • Pricing and licensing. • Any data types. • Analytical queries. • Integration with traditional systems. • Cons • Implementation. • Map Reduce ease of use. • Intense calculations with little data. • In memory. • Real time analytics. The Apache Hadoop software library is a framework that allows the distributed processing of large data sets across clusters of computers using simple programming models.
  • 4.
    Hadoop • Hadoop Common •HDFS • Map Reduce • YARN
  • 5.
  • 6.
  • 9.
    Apache Spark • Pros •100X faster than Map Reduce. • Ease of use. • Streaming, Mllib, Graph and SQL. • Pricing and licensing. • In memory. • Integration with Hadoop. • Cons • Integration with traditional systems. • Limited memory per machine(GC). • Configuration. Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.
  • 10.
  • 11.
  • 12.
    Resilient Distributed Datasets Adistributed memory abstraction that allows programmers to perform in-memory computations on large clusters while retaining the fault tolerance of data flow model like MapReduce.* • Immutability • Lineage (reconstruct lost partitions) • Fault tolerance through logging updates made to a dataset (single operation applied to many records) • Creation: • Reading a dataset from storage (HDFS or any other) • From other RDDs *Technical Report No. UCB/EECS-2011-82, available at: http://www.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-82.html
  • 13.
    RDD operations • Transformations •Lazy evaluated (executed by calling an action) • Reduces wait states • Better pipelining • Actions • Runned immediately • Return value to the application or export to storage system • map(f : T ⇒ U) • filter(f : T ⇒ Bool) • groupByKey() • join() • count() • collect() • reduce(f : (T, T) ⇒ T) • save(path: String)
  • 14.
    Spark program lifecycle CreateRDD (external data or parallelize collection) Transformation (lazy evaluated) Cache RDD (for reuse) Action (execute computation and return results)
  • 15.
    Spark in acluster mode * http://spark.apache.org/docs/latest/img/cluster-overview.png
  • 16.
    PySpark • Python APIfor Spark • Easy-to-use programming abstraction and parallel runtime: • “Here’s an operation, run it on all of the data” • Dynamically typed (RDDs can hold objects of multiple types) • Integrate with other Python libraries, such as Numpy, Pandas, Scikit-learn, Flask • Run Spark from Jupyter notebooks
  • 17.
    Spark Dataframes DataFrames area common data science abstraction that go across languages. A data frame is a table, or two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case. A Spark DataFrame is a distributed collection of data organized into named columns, and can be created: • - from structured data files • - from Hive tables • - from external databases • - from RDDs Some supported operations: - slice data • - sort data • - aggregate data • - join with other dataframes
  • 18.
    Dataframe benefits • Lazyevaluation • Domain specific language for distributed data manipulation • Automatic parallelization and cluster distribution • Integration with pipeline API for Mllib • Query structured data with SQL (using SQLContext) • Integration with Pandas Dataframes (and other Python data libraries) from pyspark.sql import SQLContext sqlContext = SQLContext(sc) df = sqlContext.read.json("data.json") df.show() df.select(“id”).show() df.filter(df[”id”] > 10).show() from pyspark.sql import SQLContext sqlContext = SQLContext(sc) df = sqlContext.read.json("data.json") df.registerTempTable(“data”) results = sqlContext.sql(“SELECT * FROM data WHERE id > 10”)
  • 19.
    Pandas DF vsSpark DF Single machine tool (all data needs to fit to memory, except with HDF5) Distributed (data > memory) Better API Good API No parallelism Parallel by default Mutable Immutable Some function differences – reading data, counting, displaying, inferring types, statistics, creating new columns (https://medium.com/@chris_bour/6-differences-between-pandas-and-spark-dataframes-1380cec394d2 )
  • 20.
    A very popularbenchmark * https://databricks.com/wp-content/uploads/2015/02/Screen-Shot-2015-02-16-at-9.46.39-AM-1024x457.png
  • 21.
    Big Data toolsin practice Darko Marjanović, darko@thingsolver.com Miloš Milovanović, milos@thingsolver.com