Heuritech: Apache Spark REX

ABOUT ME
Didier Marin
PhD in Computer Science (UPMC)
Machine Learning, Reinforcement Learning & Robotics
Co-founder of Heuritech
Likes functional programming and distributed computing

We develop tools to make sense from raw text data
Customer insight using the text of visited web pages

Data Analytics Platform
Qualify users using their web logs
50M lines/day
Match CRM and web data

WHY SPARK ?
Performance, in particular when
batch size < total RAM in cluster
More general than MR, high-level API
Extensions (ML, streaming) and
connectors (Cassandra)
Growing community

PARSING LOGS
defparseLine(line:String):
Either[ParsingError,LogData]=???
vallogs=sc.textFile("logfile").map(parseLine(_))
valvalidLogs=logs.flatMap(_.right.toOption)

CLUSTER CONFIGURATION
LXC + salt
N containers : 1 master/executor + (N-1) executors
Cassandra node for each Spark executor
Using an "uber"-JAR to submit jobs
Sharing data through NFS

MANAGING SPARK'S MEMORY
Default: 40 % working memory, 60 % cache
20 % of cache used to unroll blocks
Explicit caching for huge RDDs we reuse:
validLogs.persist(StorageLevel.MEMORY_AND_DISK)
Partition tuning may be necessary (spark.default.parallelism)

AGGREGATION
valwords=sc.parallelize(List("a","b","a","c"))
words.groupBy(x=>x).mapValues(_.size).collect
//Array((a,2),(b,1),(c,1))
words.map(x=>(x,1)).reduceByKey(_+_).collect
//Array((a,2),(b,1),(c,1))

see also &
AGGREGATION
reduceByKey
combineByKey foldByKey

Databricks knowledge base
Spark users mailing list
Parsing Apache logs with Spark (Scala)
USEFUL LINKS
github.com/databricks/spark-knowledgebase
apache-spark-user-list.1001560.n3.nabble.com
alvinalexander.com/scala/analyzing-apache-access-logs-files-
spark-scala

THANK YOU !
contact@heuritech.com

Heuritech: Apache Spark REX

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Heuritech: Apache Spark REX

Similar to Heuritech: Apache Spark REX (20)

Recently uploaded

Recently uploaded (20)

Heuritech: Apache Spark REX