Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Petabyte-Scale Text
Processing with Spark
Oleksii Sliusarenko, Grammarly Inc.
E-mail: aliaxey90 (at) gmail (dot) com
Read ...
Modern error correcting
depending from the weatherdepending on the weather
Size: 3 Petabytes
Format: WARC - Raw HTTP protocol dump
We need: 1 PB or 2000 x 480GB SSD disks
Common Crawl = internet du...
High-level pipeline view
Extract texts English Filter Deduplicate
Break into
words
Count
frequencies
Typical processing step example
Processing example:
count each n-gram frequency
Input data example:
<sentence> <tab> <freq...
Classic and modern approaches
Our alternatives
$12000
$3000
$1000
Default choice: Amazon EMR
$12000
$24000
OOM
segfault
Our MapReduce
12x faster than
Hadoop
Easy to learn Full support
2x2=4
Our MapReduce
Hardware failures Network failures
Distributed failsafe
difficulties:
Fixing Spark
3 months!
First of all
Latest stable
Latest stable
◈ Build Spark with patch
◈ Don’t forget Hadoop native libraries
The hardest button
S3 HEAD request failed for "file path" -
ResponseCode=403, ResponseMessage=Forbidden.
Why???
HTTP Head Request
HTTP body contains the
error description, but it’
s not fetched!
No body!
Possible reasons
Possible reasons:
◈ AccessDenied
◈ AccountProblem
◈ CrossLocationLoggingProhibited
◈ InvalidAccessKeyId
◈...
We need to go deeper!
Spark Hadoop JetS3t HttpClient
Fix here
Fixing Spark
◈ Choose latest filesystem: S3A, not S3 or S3N
◈ conf.setInt("fs.s3a.connection.maximum", 100)
◈ Use DirectOu...
Fixing Spark
◈ Spark.default.parallelism = cores * 3
◈ spark_mb = system_ram_mb * 4 // 5
◈ set("spark.akka.frameSize", "20...
Fixing Spark
◈ Don’t force Kryo class registration
◈ Use bzip2 compression for input files
Fixing
miscellaneous
Our Ultimate Spark Recipe
See Grammarly tech blog
for more info
Use spot instances
Spot instance
80% cheaper!
Safe Transient
Regular instance
Cheap
Expensive
◈ We spent the same amount of money
◈ Further experiments will be cheaper
◈ You can save three months!
Was It All Worth It?
◈ Don’t reinvent the wheel
◈ New technology will eat a lot of time
◈ Don’t be afraid to dive into code
◈ Look at problems ...
Thanks!
Any questions?
You can find me at aliaxey90 (at) gmail (dot) com
Read the full article in Grammarly tech blog
Upcoming SlideShare
Loading in …5
×

Petabyte-Scale Text Processing with Spark

At Grammarly, we have long used Amazon EMR with Hadoop and Pig in support of our big data processing needs. However, we were really excited about the improvements that the maturing Apache Spark offers over Hadoop and Pig, and so set about getting Spark to work with our petabyte text data set. This presentation describes the challenges we had in the process and a scalable working setup of Spark that we have discovered as a result.

  • Login to see the comments

Petabyte-Scale Text Processing with Spark

  1. 1. Petabyte-Scale Text Processing with Spark Oleksii Sliusarenko, Grammarly Inc. E-mail: aliaxey90 (at) gmail (dot) com Read the full article in Grammarly tech blog
  2. 2. Modern error correcting depending from the weatherdepending on the weather
  3. 3. Size: 3 Petabytes Format: WARC - Raw HTTP protocol dump We need: 1 PB or 2000 x 480GB SSD disks Common Crawl = internet dump
  4. 4. High-level pipeline view Extract texts English Filter Deduplicate Break into words Count frequencies
  5. 5. Typical processing step example Processing example: count each n-gram frequency Input data example: <sentence> <tab> <frequency> Output data example: <n-gram> <tab> <frequency> My name is Bob. 12 Kiev is a capital. 25 name is 12 is 37
  6. 6. Classic and modern approaches
  7. 7. Our alternatives $12000 $3000 $1000
  8. 8. Default choice: Amazon EMR $12000 $24000 OOM segfault
  9. 9. Our MapReduce 12x faster than Hadoop Easy to learn Full support 2x2=4
  10. 10. Our MapReduce Hardware failures Network failures Distributed failsafe difficulties:
  11. 11. Fixing Spark 3 months!
  12. 12. First of all Latest stable Latest stable ◈ Build Spark with patch ◈ Don’t forget Hadoop native libraries
  13. 13. The hardest button S3 HEAD request failed for "file path" - ResponseCode=403, ResponseMessage=Forbidden. Why???
  14. 14. HTTP Head Request HTTP body contains the error description, but it’ s not fetched! No body!
  15. 15. Possible reasons Possible reasons: ◈ AccessDenied ◈ AccountProblem ◈ CrossLocationLoggingProhibited ◈ InvalidAccessKeyId ◈ InvalidObjectState ◈ InvalidPayer ◈ InvalidSecurity ◈ NotSignedUp ◈ RequestTimeTooSkewed ◈ SignatureDoesNotMatch
  16. 16. We need to go deeper! Spark Hadoop JetS3t HttpClient Fix here
  17. 17. Fixing Spark ◈ Choose latest filesystem: S3A, not S3 or S3N ◈ conf.setInt("fs.s3a.connection.maximum", 100) ◈ Use DirectOutputCommitter ◈ --conf spark.hadoop.fs.s3a.access.key=… Fixing S3
  18. 18. Fixing Spark ◈ Spark.default.parallelism = cores * 3 ◈ spark_mb = system_ram_mb * 4 // 5 ◈ set("spark.akka.frameSize", "2047") Fixing OOM
  19. 19. Fixing Spark ◈ Don’t force Kryo class registration ◈ Use bzip2 compression for input files Fixing miscellaneous
  20. 20. Our Ultimate Spark Recipe See Grammarly tech blog for more info
  21. 21. Use spot instances Spot instance 80% cheaper! Safe Transient Regular instance Cheap Expensive
  22. 22. ◈ We spent the same amount of money ◈ Further experiments will be cheaper ◈ You can save three months! Was It All Worth It?
  23. 23. ◈ Don’t reinvent the wheel ◈ New technology will eat a lot of time ◈ Don’t be afraid to dive into code ◈ Look at problems from various angles ◈ Use spot instances Take-aways
  24. 24. Thanks! Any questions? You can find me at aliaxey90 (at) gmail (dot) com Read the full article in Grammarly tech blog

×