Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Budapest Spark Meetup - Apache Spark @enbrite.ly

16,612 views

Published on

Budapest Spark Meetup - Apache Spark @enbrite.ly presentation held on
March 30, 2016.

The vision we all share at enbrite.ly is to create the next generation decision supporting system in online advertising that combines the market needs; anti-fraud, viewability, brand safety and traffic quality assurances in one platform. We do this by analyzing vast amount of data to create value for our customers. In the last 6 months we created our ETL pipeline, the core component of our data platform based on Apache Spark. In this presentation I share the journey from the whiteboard designs to the maintenance of a TB-scale data pipeline. I share the lessons we learned and the ups and downs using Spark in scale.

Published in: Data & Analytics
  • Your customer service is one of the best experiences I have had. Thanks again. ➢➢➢ https://w.url.cn/s/Aaxmqpl
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Check the source ⇒ www.HelpWriting.net ⇐ This site is really helped me out gave me relief from headaches. Good luck!
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • DOWNLOAD THIS BOOKS INTO AVAILABLE FORMAT (2019 Update) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { https://soo.gd/irt2 } ......................................................................................................................... Download Full EPUB Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download Full doc Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download PDF EBOOK here { https://soo.gd/irt2 } ......................................................................................................................... Download EPUB Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download doc Ebook here { https://soo.gd/irt2 } ......................................................................................................................... ......................................................................................................................... ................................................................................................................................... eBook is an electronic version of a traditional print book THIS can be read by using a personal computer or by using an eBook reader. (An eBook reader can be a software application for use on a computer such as Microsoft's free Reader application, or a book-sized computer THIS is used solely as a reading device such as Nuvomedia's Rocket eBook.) Users can purchase an eBook on diskette or CD, but the most popular method of getting an eBook is to purchase a downloadable file of the eBook (or other reading material) from a Web site (such as Barnes and Noble) to be read from the user's computer or reading device. Generally, an eBook can be downloaded in five minutes or less ......................................................................................................................... .............. Browse by Genre Available eBooks .............................................................................................................................. Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, ......................................................................................................................... ......................................................................................................................... .....BEST SELLER FOR EBOOK RECOMMEND............................................................. ......................................................................................................................... Blowout: Corrupted Democracy, Rogue State Russia, and the Richest, Most Destructive Industry on Earth,-- The Ride of a Lifetime: Lessons Learned from 15 Years as CEO of the Walt Disney Company,-- Call Sign Chaos: Learning to Lead,-- StrengthsFinder 2.0,-- Stillness Is the Key,-- She Said: Breaking the Sexual Harassment Story THIS Helped Ignite a Movement,-- Atomic Habits: An Easy & Proven Way to Build Good Habits & Break Bad Ones,-- Everything Is Figureoutable,-- What It Takes: Lessons in the Pursuit of Excellence,-- Rich Dad Poor Dad: What the Rich Teach Their Kids About Money THIS the Poor and Middle Class Do Not!,-- The Total Money Makeover: Classic Edition: A Proven Plan for Financial Fitness,-- Shut Up and Listen!: Hard Business Truths THIS Will Help You Succeed, ......................................................................................................................... .........................................................................................................................
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • DOWNLOAD THIS BOOKS INTO AVAILABLE FORMAT (2019 Update) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { https://soo.gd/irt2 } ......................................................................................................................... Download Full EPUB Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download Full doc Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download PDF EBOOK here { https://soo.gd/irt2 } ......................................................................................................................... Download EPUB Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download doc Ebook here { https://soo.gd/irt2 } ......................................................................................................................... ......................................................................................................................... ................................................................................................................................... eBook is an electronic version of a traditional print book THIS can be read by using a personal computer or by using an eBook reader. (An eBook reader can be a software application for use on a computer such as Microsoft's free Reader application, or a book-sized computer THIS is used solely as a reading device such as Nuvomedia's Rocket eBook.) Users can purchase an eBook on diskette or CD, but the most popular method of getting an eBook is to purchase a downloadable file of the eBook (or other reading material) from a Web site (such as Barnes and Noble) to be read from the user's computer or reading device. Generally, an eBook can be downloaded in five minutes or less ......................................................................................................................... .............. Browse by Genre Available eBooks .............................................................................................................................. Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, ......................................................................................................................... ......................................................................................................................... .....BEST SELLER FOR EBOOK RECOMMEND............................................................. ......................................................................................................................... Blowout: Corrupted Democracy, Rogue State Russia, and the Richest, Most Destructive Industry on Earth,-- The Ride of a Lifetime: Lessons Learned from 15 Years as CEO of the Walt Disney Company,-- Call Sign Chaos: Learning to Lead,-- StrengthsFinder 2.0,-- Stillness Is the Key,-- She Said: Breaking the Sexual Harassment Story THIS Helped Ignite a Movement,-- Atomic Habits: An Easy & Proven Way to Build Good Habits & Break Bad Ones,-- Everything Is Figureoutable,-- What It Takes: Lessons in the Pursuit of Excellence,-- Rich Dad Poor Dad: What the Rich Teach Their Kids About Money THIS the Poor and Middle Class Do Not!,-- The Total Money Makeover: Classic Edition: A Proven Plan for Financial Fitness,-- Shut Up and Listen!: Hard Business Truths THIS Will Help You Succeed, ......................................................................................................................... .........................................................................................................................
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Dating direct: ❤❤❤ http://bit.ly/2Q98JRS ❤❤❤
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Budapest Spark Meetup - Apache Spark @enbrite.ly

  1. 1. Apache Spark @enbrite.ly Budapest Spark Meetup March 30, 2016
  2. 2. Joe MÉSZÁROS software engineer @joemesz joemeszaros
  3. 3. Who we are? Our vision is to revolutionize the KPIs and metrics the online advertisement industry currently using. With our products, Antifraud, Brandsafety and Viewability we provide actionable data to our customers.
  4. 4. Agenda ● What we do? ● How we do? - enbrite.ly data platform ● Real world antifraud example ● LL + Spark in scale +/-
  5. 5. DATA COLLECTION ANALYZE DATA PROCESSION ANTI FRAUD VIEWABILITY BRAND SAFETY REPORT + API What we do?
  6. 6. How we do? DATA COLLECTION
  7. 7. How we do? DATA PROCESSION
  8. 8. Amazon EMR ● Most popular cloud service provider ● Amazon Big Data ecosystem ● Applications: Hadoop, Spark, Hive, …. ● Scaling is easy ● Do not trust the BIG guys (API problem) ● Spark application in EMR runs on YARN (cluster manager) For more information: https://aws.amazon.com/elasticmapreduce/
  9. 9. Tools we use https://github.com/spotify/luigi | 4500 ★ | more than 200 contributors Workflow engine, that helps you build complex data pipelines of batch jobs. Created by Spotify’s engineering team.
  10. 10. Your friendly plumber, that sticks your Hadoop, Spark, … jobs with simple dependency definition and failure management.
  11. 11. class SparkMeetupTask(luigi.Task): param = luigi.Parameter(default=42) def requires(self): return SomeOtherTask(self.param) def run(self): with self.output().open('w') as f: f.write('Hello Spark meetup!') def output(self): return luigi.LocalTarget('/meetup/message') if __name__ == '__main__': luigi.run()
  12. 12. Web interface
  13. 13. Web interface
  14. 14. Let me tell you a short story...
  15. 15. Tools we created GABO LUIGI Luigi + enbrite.ly extensions = Gabo Luigi ● Dynamic task configuration + dependencies ● Reshaped web interface ● Define reusable data pipeline template ● Monitoring for each task
  16. 16. Tools we created GABO LUIGI
  17. 17. Tools we created GABO LUIGI We plan to release it to the wild and make it open source as part of Spotify’s Luigi! If you are interested, you are front of open doors :-)
  18. 18. Tools we created GABO MARATHON Motivation: Testing with large data sets and slow batch jobs is boring and wasteful!
  19. 19. Tools we created GABO MARATHON Graphite
  20. 20. Real world example You are fighting against robots and want to humanize ad tech era. You have a simple idea to detect bot traffic, which saves the world. Let’s implement it!
  21. 21. Real world example THE IDEA: Analyse events which are too hasty and deviate from regular, humanlike profiles: too many clicks in a defined timeframe. INPUT: Load balancer access logs files on S3 OUTPUT: Print invalid sessions
  22. 22. Step 1: convert access log files to events Step 2: sessionize events Step 3: detect too many clicks How to solve it?
  23. 23. The way to access log { "session_id": "spark_meetup_jsmmmoq", "timestamp": 1456080915621, "type": "click" } eyJzZXNzaW9uX2lkIjoic3Bhcmtfb WVldHVwX2pzbW1tb3EiLCJ0aW1l c3RhbXAiOjE0NTYwODA5MTU2M jEsInR5cGUiOiAiY2xpY2sifQo= Click event attributes (created by JS tracker) Access log format TS CLIENT_IP STATUS "GET https://api.endpoint?event=eyJzZXNzaW9uX2lkIj..." 1. 2. 3.
  24. 24. Step 1: log to event Simplify: log files are on the local storage, only click events. SparkConf conf = new SparkConf().setAppName("LogToEvent"); JavaSparkContext sparkContext = new JavaSparkContext(conf); JavaRDD<String> rawEvents = sparkContext.textFile(LOG_FOLDER); // 2016-02-29T23:50:36.269432Z 178.165.132.37 200 "GET https://api.endpoint?event=eyJzZXNzaW9uX2lkIj..."
  25. 25. Step 1: log to event JavaRDD<String> rawUrls = rawEvents.map(l -> l.split("s+")[3]); // GET https://api.endpoint?event=eyJzZXNzaW9uX2lkIj... JavaRDD<String> rawUrls = rawEvents.map(l -> l.split("s+")[3]); // GET https://api.endpoint?event=eyJzZXNzaW9uX2lkIj... JavaRDD<String> eventParameter = rawUrls .map(u -> parseUrl(u).get("event")); // eyJzZXNzaW9uX2lkIj… JavaRDD<String> rawUrls = rawEvents.map(l -> l.split("s+")[3]); // GET https://api.endpoint?event=eyJzZXNzaW9uX2lkIj... JavaRDD<String> eventParameter = rawUrls .map(u -> parseUrl(u).get("event")); // eyJzZXNzaW9uX2lk JavaRDD<String> base64Decoded = eventParameter .map(e -> new String(Base64.getDecoder().decode(e))); // {"session_id": "spark_meetup_jsmmmoq", // "timestamp": 1456080915621, "type": "click"} IoUtil.saveAsJsonGzipped(base64Decoded);
  26. 26. Step 2: event to session SparkConf conf = new SparkConf().setAppName("EventToSession"); JavaSparkContext sparkContext = new JavaSparkContext(conf); JavaRDD<String> jsonEvents = IoUtil.readFrom(LOCAL_STORAGE); JavaRDD<ClickEvent> clickEvents = jsonEvents .map(e -> readJsonObject(e)); SparkConf conf = new SparkConf().setAppName("EventToSession"); JavaSparkContext sparkContext = new JavaSparkContext(conf); JavaRDD<String> jsonEvents = IoUtil.readFrom(LOCAL_STORAGE); JavaRDD<ClickEvent> clickEvents = jsonEvents .map(e -> readJsonObject(e)); JavaPairRDD<String, Iterable<ClickEvent>> groupedEvents = clickEvents.groupBy(e -> e.getSessionId()); JavaPairRDD<String, Session> sessions = grouped .flatMapValues(sessionizer);
  27. 27. Step 2: event to session //Sessionizer public Session call(Iterable<ClickEvent> clickEvents) { List<ClickEvent> ordered = sortByTimestamp(clickEvents); Session session = new Session(); for (ClickEvent event: ordered) { session.addClick(event) } return session; }
  28. 28. Step 2: event to session class Session { public Boolean isBad = False; public List<Long> clickTimestamps; public void addClick(ClickEvent e) { clickTimestamps.add(e.getTimestamp()); } public void badify() { this.isBad = True; } }
  29. 29. Step 3: detect bad sessions JavaRDD<Session> sessions = IoUtil.readFrom(LOCAL_STORAGE); JavaRDD<Session> markedSessions = sessions .map(s -> s.clickTimestamps.size() > THRESHOLD); JavaRDD<Session> badSessions = markedSessions .filter(s -> s.isBad()); badSessions.collect().foreach(println);
  30. 30. Congratulation! MISSION COMPLETED YOU just saved the world with a simple idea within ~10 minutes.
  31. 31. Using Spark pros ● Sparking is funny, community, tools ● Easy to start with it ● Language support: Python, Scala, Java, R ● Unified stack: batch, streaming, SQL, ML
  32. 32. Using Spark cons ● You need memory and memory ● Distributed application, hard to debug ● Hard to optimize
  33. 33. Lessons learned ● Do not use default config, always optimize! ● Eliminate technical debt + automate ● Failures happen, use monitoring from the very first breath + fault tolerant implementation ● Sparking is funny, but not a hammer for everything
  34. 34. Data platform future ● Would like to play with Redshift ● Change data format (avro, parquet, …) ● Would like to play with streaming ● Would like to play with Spark 2.0
  35. 35. WE ARE HIRING! working @exPrezi office, K9 check out the company in Forbes :-) amazing company culture BUT the real reason ….
  36. 36. WE ARE HIRING! … is our mood manager, Bigyó :)
  37. 37. Joe MÉSZÁROS software engineer joe@enbrite.ly @joemesz @enbritely joemeszaros enbritely THANK YOU! QUESTIONS?

×