SlideShare a Scribd company logo
،‫مفاهیم‬ ‫داده؛‬ ‫کالن‬
‫ها‬‫حل‬ ‫راه‬ ‫و‬ ‫ها‬‫چالش‬
VAHID AMIRI
VAHIDAMIRY.IR
VAHID.AMIRY@GMAIL.COM
@DATASTACK
‫های‬ ‫داده‬ ‫پردازش‬ ‫و‬ ‫توزیعی‬ ‫محاسبات‬ ‫ملی‬ ‫کنفرانس‬ ‫سومین‬‫بزرگ‬
‫تبریز‬
‫اردیبهشت‬96
Big DataData Data Processing
Data Gathering
Data Storing
Big Data Definition
 No single standard definition…
“Big Data” is data whose scale, diversity, and complexity
require new architecture, techniques, algorithms, and
analytics to manage it and extract value and hidden
knowledge from it…
Big Data: 3V’s
Some Make it 4V’s
Solution
Big
Data
Big
Comput
ation
Big
Computer
Big Data Solutions
 Hadoop is a software framework for distributed processing of large datasets
across large clusters of computers
 Hadoop implements Google’s MapReduce, using HDFS
 MapReduce divides applications into many small blocks of work.
 HDFS creates multiple replicas of data blocks for reliability, placing them on compute
nodes around the cluster
Hadoop
Spark Stack
 More than just the Elephant in the room
 Over 120+ types of NoSQL databases
So many NoSQL options
 Extend the Scope of RDBMS
 Caching
 Master/Slave
 Table Partitioning
 Federated Tables
 Sharding
NoSql
 Relational database (RDBMS) technology
 Has not fundamentally changed in over 40 years
 Default choice for holding data behind many web apps
 Handling more users means adding a bigger server
RDBMS with Extended Functionality
Vs.
Systems Built from Scratch
with Scalability in Mind
NoSQL Movement
CAP Theorem
 “Of three properties of shared-data systems – data Consistency, system
Availability and tolerance to network Partition – only two can be achieved at
any given moment in time.”
“Of three properties of shared-data systems – data
Consistency, system Availability and tolerance to
network Partition – only two can be achieved at any
given moment in time.”
 CA
 Highly-available consistency
 CP
 Enforced consistency
 AP
 Eventual consistency
CAP Theorem
Flavors of NoSQL
 Schema-less
 State (Persistent or Volatile)
 Example:
 Redis
 Amazon DynamoDB
Key / Value Database
 Wide, sparse column sets
 Schema-light
 Examples:
 Cassandra
 HBase
 BigTable
 GAE HR DS
Column Database
 Use for data that is
 document-oriented (collection of JSON documents) w/semi structured
data
 Encodings include XML, YAML, JSON & BSON
 binary forms
 PDF, Microsoft Office documents -- Word, Excel…)
 Examples: MongoDB, CouchDB
Document Database
Graph Database
Use for data with
 a lot of many-to-many relationships
 when your primary objective is quickly
finding connections, patterns and
relationships between the objects within
lots of data
 Examples: Neo4J, FreeBase (Google)
So which type of NoSQL? Back to CAP…
CP = noSQL/column
Hadoop
Big Table
HBase
MemCacheDB
AP = noSQL/document or key/value
DynamoDB
CouchDB
Cassandra
Voldemort
CA = SQL/RDBMS
SQL Sever / SQL
Azure
Oracle
MySQL
About Apache Spark
 Fast and general purpose cluster computing system
 10x (on disk) - 100x (In-Memory) faster
 Most popular for running Iterative Machine Learning Algorithms.
 Provides high level APIs in
 Java
 Scala
 Python
 R
 http://spark.apache.org/
Why Spark ?
 Most of Machine Learning Algorithms are iterative because each iteration
can improve the results
 With Disk based approach each iteration’s output is written to disk making it
slow
Hadoop execution flow
Spark execution flow
Spark history
A Brief History: MapReduce
 MapReduce use cases showed two major limitations:
 difficultly of programming directly in MR
 performance bottlenecks, or batch not fitting the use cases
A Brief History: MapReduce
A Brief History: Spark
 Some key points about Spark:
 handles batch, interactive, and real-time within a single framework
 native integration with Java, Python, Scala
 programming at a higher level of abstraction
 more general: map/reduce is just one set of supported constructs
Spark Stack
 Spark SQL
 For SQL and unstructured data processing
 MLib
 Machine Learning Algorithms
 GraphX
 Graph Processing
 Spark Streaming
 stream processing of live data streams
Cluster Deployment
 Standalone Deploy Mode
 simplest way to deploy Spark on a private cluster
 Amazon EC2
 EC2 scripts are available
 Very quick launching a new cluster
 Apache Mesos
 Hadoop YARN
Which Language Should I Use?
 Standalone programs can be written in any, but console is only Python & Scala
 Python developers: can stay with Python for both
 Java developers: consider using Scala for console (to learn the API)
 Performance: Java / Scala will be faster (statically typed), but Python can do well
for numerical work with NumPy
RDD
 Resilient Distributed Datasets (RDD) are the primary abstraction in Spark – a
fault-tolerant collection of elements that can be operated on in parallel
RDD
 two types of operations on RDDs:
 transformations and actions
 transformations are lazy
 (not computed immediately)
 the transformed RDD gets recomputed
 when an action is run on it (default)
 however, an RDD can be persisted into
 storage in memory or disk
Transformations
 Transformations create a new dataset from an existing one
 All transformations in Spark are lazy: they do not compute their results right away
– instead they remember the transformations applied to some base dataset
 optimize the required calculations
 recover from lost data partitions
Transformations
Transformations
Actions
Actions
Execution Flow
Standalone (Scala)
/* SimpleApp.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SimpleApp {
def main(args: Array[String]) {
val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your
system
val conf = new SparkConf().setAppName("Simple Application")
.setMaster(“local")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
}
}
Standalone (Java)
public class SimpleApp {
public static void main(String[] args) {
String logFile = “LOGFILES_ADDRESS";
SparkConf conf = new SparkConf().setAppName("Simple Application").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> logData = sc.textFile(logFile);
long numAs = logData.filter(new Function<String, Boolean>() {
public Boolean call(String s) { return s.contains("a"); }
}).count();
long numBs = logData.filter(new Function<String, Boolean>() {
public Boolean call(String s) { return s.contains("b"); }
}).count();
System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs);
}
}
Big data vahidamiri-tabriz-13960226-datastack.ir

More Related Content

What's hot

Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
Flavio Vit
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
Harikrishnan K
 
Big Data technology Landscape
Big Data technology LandscapeBig Data technology Landscape
Big Data technology Landscape
ShivanandaVSeeri
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
Dataflair Web Services Pvt Ltd
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
Shubham Parmar
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
Rohit Agrawal
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
InSemble
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
sravya raju
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
Ahmed Salman
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop Introduction
Dzung Nguyen
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
Rahul Agarwal
 
Hadoop An Introduction
Hadoop An IntroductionHadoop An Introduction
Hadoop An Introduction
Mohanasundaram Ponnusamy
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
Asis Mohanty
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
Thanh Nguyen
 
Big Data on the Microsoft Platform
Big Data on the Microsoft PlatformBig Data on the Microsoft Platform
Big Data on the Microsoft Platform
Andrew Brust
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
sudhakara st
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP
vinoth kumar
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Simplilearn
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
MaulikLakhani
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
Danairat Thanabodithammachari
 

What's hot (20)

Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
 
Big Data technology Landscape
Big Data technology LandscapeBig Data technology Landscape
Big Data technology Landscape
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop Introduction
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Hadoop An Introduction
Hadoop An IntroductionHadoop An Introduction
Hadoop An Introduction
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Big Data on the Microsoft Platform
Big Data on the Microsoft PlatformBig Data on the Microsoft Platform
Big Data on the Microsoft Platform
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 

Similar to Big data vahidamiri-tabriz-13960226-datastack.ir

In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You Think
Ike Ellis
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache Spark
Elvis Saravia
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
Richard Kuo
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
Mark Kromer
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
Geek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and ScalaGeek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and Scala
Atif Akhtar
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computing
Animesh Chaturvedi
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
Bds session 13 14
Bds session 13 14Bds session 13 14
Bds session 13 14
Infinity Tech Solutions
 
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
IJCSIS Research Publications
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch Processing
Edureka!
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment
Jim Hatcher
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
sudhakara st
 
hadoop-spark.ppt
hadoop-spark.ppthadoop-spark.ppt
hadoop-spark.ppt
NouhaElhaji1
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
Josi Aranda
 

Similar to Big data vahidamiri-tabriz-13960226-datastack.ir (20)

In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You Think
 
Nosql seminar
Nosql seminarNosql seminar
Nosql seminar
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache Spark
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Geek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and ScalaGeek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and Scala
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computing
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Bds session 13 14
Bds session 13 14Bds session 13 14
Bds session 13 14
 
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch Processing
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
hadoop-spark.ppt
hadoop-spark.ppthadoop-spark.ppt
hadoop-spark.ppt
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 

Recently uploaded

Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
correoyaya
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
alex933524
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
James Polillo
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
AlejandraGmez176757
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
theahmadsaood
 

Recently uploaded (20)

Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 

Big data vahidamiri-tabriz-13960226-datastack.ir

  • 1. ،‫مفاهیم‬ ‫داده؛‬ ‫کالن‬ ‫ها‬‫حل‬ ‫راه‬ ‫و‬ ‫ها‬‫چالش‬ VAHID AMIRI VAHIDAMIRY.IR VAHID.AMIRY@GMAIL.COM @DATASTACK ‫های‬ ‫داده‬ ‫پردازش‬ ‫و‬ ‫توزیعی‬ ‫محاسبات‬ ‫ملی‬ ‫کنفرانس‬ ‫سومین‬‫بزرگ‬ ‫تبریز‬ ‫اردیبهشت‬96
  • 2. Big DataData Data Processing Data Gathering Data Storing
  • 3.
  • 4. Big Data Definition  No single standard definition… “Big Data” is data whose scale, diversity, and complexity require new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it…
  • 6. Some Make it 4V’s
  • 9.  Hadoop is a software framework for distributed processing of large datasets across large clusters of computers  Hadoop implements Google’s MapReduce, using HDFS  MapReduce divides applications into many small blocks of work.  HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster Hadoop
  • 10.
  • 12.  More than just the Elephant in the room  Over 120+ types of NoSQL databases So many NoSQL options
  • 13.  Extend the Scope of RDBMS  Caching  Master/Slave  Table Partitioning  Federated Tables  Sharding NoSql  Relational database (RDBMS) technology  Has not fundamentally changed in over 40 years  Default choice for holding data behind many web apps  Handling more users means adding a bigger server
  • 14. RDBMS with Extended Functionality Vs. Systems Built from Scratch with Scalability in Mind NoSQL Movement
  • 15. CAP Theorem  “Of three properties of shared-data systems – data Consistency, system Availability and tolerance to network Partition – only two can be achieved at any given moment in time.”
  • 16. “Of three properties of shared-data systems – data Consistency, system Availability and tolerance to network Partition – only two can be achieved at any given moment in time.”  CA  Highly-available consistency  CP  Enforced consistency  AP  Eventual consistency CAP Theorem
  • 18.  Schema-less  State (Persistent or Volatile)  Example:  Redis  Amazon DynamoDB Key / Value Database
  • 19.  Wide, sparse column sets  Schema-light  Examples:  Cassandra  HBase  BigTable  GAE HR DS Column Database
  • 20.  Use for data that is  document-oriented (collection of JSON documents) w/semi structured data  Encodings include XML, YAML, JSON & BSON  binary forms  PDF, Microsoft Office documents -- Word, Excel…)  Examples: MongoDB, CouchDB Document Database
  • 21. Graph Database Use for data with  a lot of many-to-many relationships  when your primary objective is quickly finding connections, patterns and relationships between the objects within lots of data  Examples: Neo4J, FreeBase (Google)
  • 22. So which type of NoSQL? Back to CAP… CP = noSQL/column Hadoop Big Table HBase MemCacheDB AP = noSQL/document or key/value DynamoDB CouchDB Cassandra Voldemort CA = SQL/RDBMS SQL Sever / SQL Azure Oracle MySQL
  • 23.
  • 24.
  • 25. About Apache Spark  Fast and general purpose cluster computing system  10x (on disk) - 100x (In-Memory) faster  Most popular for running Iterative Machine Learning Algorithms.  Provides high level APIs in  Java  Scala  Python  R  http://spark.apache.org/
  • 26. Why Spark ?  Most of Machine Learning Algorithms are iterative because each iteration can improve the results  With Disk based approach each iteration’s output is written to disk making it slow Hadoop execution flow Spark execution flow
  • 28. A Brief History: MapReduce  MapReduce use cases showed two major limitations:  difficultly of programming directly in MR  performance bottlenecks, or batch not fitting the use cases
  • 29. A Brief History: MapReduce
  • 30. A Brief History: Spark  Some key points about Spark:  handles batch, interactive, and real-time within a single framework  native integration with Java, Python, Scala  programming at a higher level of abstraction  more general: map/reduce is just one set of supported constructs
  • 31. Spark Stack  Spark SQL  For SQL and unstructured data processing  MLib  Machine Learning Algorithms  GraphX  Graph Processing  Spark Streaming  stream processing of live data streams
  • 32. Cluster Deployment  Standalone Deploy Mode  simplest way to deploy Spark on a private cluster  Amazon EC2  EC2 scripts are available  Very quick launching a new cluster  Apache Mesos  Hadoop YARN
  • 33. Which Language Should I Use?  Standalone programs can be written in any, but console is only Python & Scala  Python developers: can stay with Python for both  Java developers: consider using Scala for console (to learn the API)  Performance: Java / Scala will be faster (statically typed), but Python can do well for numerical work with NumPy
  • 34. RDD  Resilient Distributed Datasets (RDD) are the primary abstraction in Spark – a fault-tolerant collection of elements that can be operated on in parallel
  • 35. RDD  two types of operations on RDDs:  transformations and actions  transformations are lazy  (not computed immediately)  the transformed RDD gets recomputed  when an action is run on it (default)  however, an RDD can be persisted into  storage in memory or disk
  • 36. Transformations  Transformations create a new dataset from an existing one  All transformations in Spark are lazy: they do not compute their results right away – instead they remember the transformations applied to some base dataset  optimize the required calculations  recover from lost data partitions
  • 42. Standalone (Scala) /* SimpleApp.scala */ import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf object SimpleApp { def main(args: Array[String]) { val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system val conf = new SparkConf().setAppName("Simple Application") .setMaster(“local") val sc = new SparkContext(conf) val logData = sc.textFile(logFile, 2).cache() val numAs = logData.filter(line => line.contains("a")).count() val numBs = logData.filter(line => line.contains("b")).count() println("Lines with a: %s, Lines with b: %s".format(numAs, numBs)) } }
  • 43. Standalone (Java) public class SimpleApp { public static void main(String[] args) { String logFile = “LOGFILES_ADDRESS"; SparkConf conf = new SparkConf().setAppName("Simple Application").setMaster("local"); JavaSparkContext sc = new JavaSparkContext(conf); JavaRDD<String> logData = sc.textFile(logFile); long numAs = logData.filter(new Function<String, Boolean>() { public Boolean call(String s) { return s.contains("a"); } }).count(); long numBs = logData.filter(new Function<String, Boolean>() { public Boolean call(String s) { return s.contains("b"); } }).count(); System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs); } }