Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Data tools in practice

569 views

Published on

Big Data tools in practice

Published in: Technology
  • Be the first to comment

Big Data tools in practice

  1. 1. Big Data tools in practice Darko Marjanović, darko@thingsolver.com Miloš Milovanović, milos@thingsolver.com
  2. 2. Agenda • Hadoop • Spark • Python
  3. 3. Hadoop • Pros • Linear scalability. • Commodity hardware. • Pricing and licensing. • Any data types. • Analytical queries. • Integration with traditional systems. • Cons • Implementation. • Map Reduce ease of use. • Intense calculations with little data. • In memory. • Real time analytics. The Apache Hadoop software library is a framework that allows the distributed processing of large data sets across clusters of computers using simple programming models.
  4. 4. Hadoop • Hadoop Common • HDFS • Map Reduce • YARN
  5. 5. Hadoop HDFS
  6. 6. Hadoop HDFS
  7. 7. Apache Spark • Pros • 100X faster than Map Reduce. • Ease of use. • Streaming, Mllib, Graph and SQL. • Pricing and licensing. • In memory. • Integration with Hadoop. • Cons • Integration with traditional systems. • Limited memory per machine(GC). • Configuration. Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.
  8. 8. Spark
  9. 9. Spark stack
  10. 10. Resilient Distributed Datasets A distributed memory abstraction that allows programmers to perform in-memory computations on large clusters while retaining the fault tolerance of data flow model like MapReduce.* • Immutability • Lineage (reconstruct lost partitions) • Fault tolerance through logging updates made to a dataset (single operation applied to many records) • Creation: • Reading a dataset from storage (HDFS or any other) • From other RDDs *Technical Report No. UCB/EECS-2011-82, available at: http://www.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-82.html
  11. 11. RDD operations • Transformations • Lazy evaluated (executed by calling an action) • Reduces wait states • Better pipelining • Actions • Runned immediately • Return value to the application or export to storage system • map(f : T ⇒ U) • filter(f : T ⇒ Bool) • groupByKey() • join() • count() • collect() • reduce(f : (T, T) ⇒ T) • save(path: String)
  12. 12. Spark program lifecycle Create RDD (external data or parallelize collection) Transformation (lazy evaluated) Cache RDD (for reuse) Action (execute computation and return results)
  13. 13. Spark in a cluster mode * http://spark.apache.org/docs/latest/img/cluster-overview.png
  14. 14. PySpark • Python API for Spark • Easy-to-use programming abstraction and parallel runtime: • “Here’s an operation, run it on all of the data” • Dynamically typed (RDDs can hold objects of multiple types) • Integrate with other Python libraries, such as Numpy, Pandas, Scikit-learn, Flask • Run Spark from Jupyter notebooks
  15. 15. Spark Dataframes DataFrames are a common data science abstraction that go across languages. A data frame is a table, or two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case. A Spark DataFrame is a distributed collection of data organized into named columns, and can be created: • - from structured data files • - from Hive tables • - from external databases • - from RDDs Some supported operations: - slice data • - sort data • - aggregate data • - join with other dataframes
  16. 16. Dataframe benefits • Lazy evaluation • Domain specific language for distributed data manipulation • Automatic parallelization and cluster distribution • Integration with pipeline API for Mllib • Query structured data with SQL (using SQLContext) • Integration with Pandas Dataframes (and other Python data libraries) from pyspark.sql import SQLContext sqlContext = SQLContext(sc) df = sqlContext.read.json("data.json") df.show() df.select(“id”).show() df.filter(df[”id”] > 10).show() from pyspark.sql import SQLContext sqlContext = SQLContext(sc) df = sqlContext.read.json("data.json") df.registerTempTable(“data”) results = sqlContext.sql(“SELECT * FROM data WHERE id > 10”)
  17. 17. Pandas DF vs Spark DF Single machine tool (all data needs to fit to memory, except with HDF5) Distributed (data > memory) Better API Good API No parallelism Parallel by default Mutable Immutable Some function differences – reading data, counting, displaying, inferring types, statistics, creating new columns (https://medium.com/@chris_bour/6-differences-between-pandas-and-spark-dataframes-1380cec394d2 )
  18. 18. A very popular benchmark * https://databricks.com/wp-content/uploads/2015/02/Screen-Shot-2015-02-16-at-9.46.39-AM-1024x457.png
  19. 19. Big Data tools in practice Darko Marjanović, darko@thingsolver.com Miloš Milovanović, milos@thingsolver.com

×