Spark in Action #CodeMania11

•

7 likes•1,043 views

Chulalongkorn University

Take a look at Apache Spark from parallel processing perspective

Technology

Spark in Action
from parallel processing perspective
ณัฐวุฒิ หนูไพโรจน์ (อาจารย์เป้ ง)
natawut.n@chula.ac.th
http://natawutn.wordpress.com
@natawutn

“ผมอยากให้มีภาพว่าอายุไม่ใช่อุปสรรคสาหรับ
การเขียนโค้ด ต่างประเทศยังเขียนกันผมหงอก”
ผู้จัดงาน Code Mania#11 ท่านหนึ่งได้กล่าวไว้

Storage Requirements for 90 days = 39,000,000,000 events (6.5TB)

Source: https://www.mssqltips.com/sqlservertip/3222/big-data-basics--part-5--introduction-to-
mapreduce/

Hadoop Shortcomings
× Too simplified
× Not interactive
× Pessimistic

Along come Apache Spark!
• Opensource Big Data Framework
from UC Berkeley
• In-Memory Analytics
• > 800 contributors
• Largest Cluster = 8,000+ nodes (Tencent)

Source: M. Zaharia, “New Directions for Spark in 2015”, Spark Summit East 2015, 18 March 2015.

Action#1 – Wifi Radius Log
• Size = 336.6 MB
• Sample Data
2016-01-01T00:00:08.000+07:00 acs1121-cen59-1
CSCOacs_Passed_Authentications 0137798608 4 0 2016-
01-01 00:00:08.198 +07:00 0113947311 5200 NOTICE
Passed-Authentication: Authentication succeeded,
ACSVersion=acs-5.6.0.22-B.225, ConfigVe...
2016-01-01T00:00:13.000+07:00 acs1121-cen59-1
CSCOacs_RADIUS_Accounting 0137798618 2 0 2016-01-01
00:00:13.644 +07:00 0113947536 3000 NOTICE Radius-
Accounting: RADIUS Accounting start request, ACSV...

Action#1 – Wifi Radius Log
val loglines = sc.textFile("wifi-radius.log")
val successlogin = loglines.filter(
line => line.contains("Authentication succeeded"))
val logintime = successlogin.map(
line => line.split(" ")(0))
val logintimepair = logintime.map(
ltime => (ltime.substring(11,13), 1))
val loginByHour = logintimepair.reduceByKey(
(x,y) => x + y).sortByKey()
loginByHour.collect().foreach(println)

Action#2 – Thai Monitoring Corpus
• Size = 40 x 200 MB
• Sample Data
{"ngram": 8, "totallen": 39, "wordseg": "}อยาก}ไป~นอน}ที่~นี่}<s>}วิว}
สวย}<s>}รอ}ชม}ค่า}<s>", "text": "อยากไปนอนที่นี่ วิวสวยยยย <img
class="img-in-emotion" title="อมยิ้ม02" alt="อมยิ้ม02“
src="http://ptcdn.info/emoticons/smiley/smiley02.png"/><br />nรอชม
ค่าาาา", "timestamp": 1392286781000, "id": 21935913, "faillen": 0, "refid":
"31650001"}
{"ngram": 6, "totallen": 26, "wordseg": "ตาม}มา~เที่ยว}ด้วย~กัน}ค่ะ}คุณ}
วา}<s>}", …

$Action#2 – Thai Monitoring Corpus val data = sqlContext.read.json("gs://tmc-data/*") val d = data.withColumn("date", from_unixtime(data("timestamp")/1000, "yyyy-MM-dd")) val failre = "<[Ff][^>]+>[^<]+</[Ff][^>]+>".r val nonthaire = "[u0021-u007F]+".r def prepareText(x:Any):String = { var text = x.asInstanceOf[String] text = text.replace("n"," ") text = text.replace("<s>"," ") text = text.replace("~","") text = text.replace("}"," ") text = failre.replaceAllIn(text, " ") text = nonthaire.replaceAllIn(text, " ") return text } val dTmp = d.select(d("date"), d("wordseg")) val dataMap = dTmp.map(item => (item(0), prepareText(item(1)))) val baseGramRDD = dataMap.flatMapValues(v => v.split("( )+")) val baseGramCountRDD = baseGramRDD.countByValue()$

Source: P.Zecevic and M.Bonaci, Spark in Action, Manning Publications, Summer 2016 (est.)

Why I like Spark?
• RDD: Simple but Elegant Design
• Very consistent abstraction
• Extensible with minimal overheads
• One feature added to base RDD,
all other modules get benefits

Similar to Spark in Action #CodeMania11

Blazing Fast Analytics with MongoDB & Spark

MongoDB

Buildingsocialanalyticstoolwithmongodb

MongoDB APAC

"Big Data made easy with a Spark" is the presentation I gave for ATO (AllThingsOpen) 2018. In this hands-on session, you will learn how to do a full Big Data scenario from ingestion to publication. You will see how we can use Java and Apache Spark to ingest data, perform some transformations, save the data. You will then perform a second lab where you will run your very first Machine Learning algorithm!

Big Data made easy with a Spark

Jean-Georges Perrin

Apache Spark is a fast and general engine for distributed computing & big data processing with APIs in Scala, Java, Python, and R. This tutorial will briefly introduce PySpark (the Python API for Spark) with some hands-on-exercises combined with a quick introduction to Spark's core concepts. We will cover the obligatory wordcount example which comes in with every big-data tutorial, as well as discuss Spark's unique methods for handling node failure and other relevant internals. Then we will briefly look at how to access some of Spark's libraries (like Spark SQL & Spark ML) from Python. While Spark is available in a variety of languages this workshop will be focused on using Spark and Python together.

A really really fast introduction to PySpark - lightning fast cluster computi...

Holden Karau

Spark & Cassandra - DevFest Córdoba

Jose Mº Muñoz

Historycznie świat dużych danych, lub jak kto woli Big Data, był zarezerwowany dla technologii pochodzących ze świata Javy. Z drugiej strony, od lat Python silnie się rozwija w analizie danych i obliczeniach naukowych, które z reguły działają na mniejszych danych. Niemniej, wiele się obecnie zmieniło. Python stał się coraz ważniejszym językiem w projekcie Spark. Ponadto nowe projekty w Python do pracy z dużymi danymi, jak Dask, stają się coraz bardziej popularne. Dodatkowo, coraz więcej zarządzanych platform chmurowych jak Google BigQuery jest powszechnie dostępnych i łatwo używalnych w Python. W tej prezentacji podsumowuje aktualny stan analizy dużych danych w Python, poparty prawdziwymi przykładami, zaletami i wadami danych podejść, oraz przemyśleniami co może przynieść przyszłość.

4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...

PROIDEA

Deep Reinforcement Learning (DRL) is a thriving area in the current AI battlefield. AlphaGO by DeepMind is a very successful application of DRL which has drawn the attention of the entire world. Besides playing games, DRL also has many practical use in industry, e.g. autonomous driving, chatbots, financial investment, inventory management, and even recommendation systems. Although DRL applications has something in common with supervised Computer Vision or Natural Language Processing tasks, they are unique in many ways. For example, they have to interact (explore) with the environment to obtain training samples along the optimization, and the method to improve the model is usually different from common supervised applications. In this talk we will share our experience of building Deep Reinforcement Learning applications on BigDL/Spark. BigDL is a well-developed deep learning library on Spark which is handy for Big Data users, but it has been mostly used for supervised and unsupervised machine learning. We have made extensions particularly for DRL algorithms (e.g. DQN, PG, TRPO and PPO, etc.), implemented classical DRL algorithms, built applications with them and did performance tuning. We are happy to share what we have learnt during this process. We hope our experience will help our audience learn how to build a RL application on their own for in their production business.

Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...

Databricks

Multiplatform Spark solution for Graph datasources by Javier Dominguez

Big Data Spain

How ElasticSearch lives in my DevOps life

琛琳饶

AI 클라우드로 완전 정복하기 - 데이터 분석부터 딥러닝까지 (윤석찬, AWS테크에반젤리스트)

Amazon Web Services Korea

Retaining globally distributed high availability

spil-engineering

Adios hadoop, Hola Spark! T3chfest 2015

dhiguero

Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...

Holden Karau

Big data clustering

Jagadeesan A S

MongoSF 2012: MongoDB Deployment Preparedness

MongoDB

Introduction To Apache Pig at WHUG

Adam Kawa

5th in the AskTOM Office Hours series on graph database technologies. https://devgym.oracle.com/pls/apex/dg/office_hours/3084 PGQL: A Query Language for Graphs Learn how to query graphs using PGQL, an expressive and intuitive graph query language that's a lot like SQL. With PGQL, it's easy to get going writing graph analysis queries to the database in a very short time. Albert and Oskar show what you can do with PGQL, and how to write and execute PGQL code.

PGQL: A Language for Graphs

Jean Ihm

These slides were presented on a Software Craftsmanship meetup @ EPAM Hungary on 26 January, 2017. During the talk we went through the evolution of structured data analytics in Spark. We compared the RDD, the SparkSQL (DataFrame) and the DataSet APIs. We used the very latest and greatest Spark 2.1, released on December 28, went through code samples and dove deep into Spark optimizations. The code samples can be downloaded from here: https://github.com/symat/spark-api-comparison

Evolution of Spark APIs

Máté Szalay-Bekő

Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...

MongoDB

MongoDB is easy to download and run locally but requires some thought and further understanding when deploying to production. At scale, schema design, indexes and query patterns really matter. So does data structure on disk, sharding, replication and data centre awareness. This talk will examine these factors in the context of analytics, and more generally, to help you optimise MongoDB for any scale. Presented at MongoDB Days London 2013 by David Mytton.

MongoDB: Optimising for Performance, Scale & Analytics

Server Density

Similar to Spark in Action #CodeMania11 (20)

Blazing Fast Analytics with MongoDB & Spark

Buildingsocialanalyticstoolwithmongodb

Big Data made easy with a Spark

A really really fast introduction to PySpark - lightning fast cluster computi...

Spark & Cassandra - DevFest Córdoba

4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...

Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...

Multiplatform Spark solution for Graph datasources by Javier Dominguez

How ElasticSearch lives in my DevOps life

AI 클라우드로 완전 정복하기 - 데이터 분석부터 딥러닝까지 (윤석찬, AWS테크에반젤리스트)

Retaining globally distributed high availability

Adios hadoop, Hola Spark! T3chfest 2015

Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...

Big data clustering

MongoSF 2012: MongoDB Deployment Preparedness

Introduction To Apache Pig at WHUG

PGQL: A Language for Graphs

Evolution of Spark APIs

Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...

MongoDB: Optimising for Performance, Scale & Analytics

Recently uploaded

MINDCTI Revenue Release Quarter One 2024

MIND CTI

Corporate and higher education. Two industries that, in the past, have had a clear divide with very little crossover. The difference in goals, learning styles and objectives paved the way for differing learning technologies platforms to evolve. Now, those stark lines are blurring as both sides are discovering they have content that’s relevant to the other. Join Tammy Rutherford as she walks through the pros and cons of corporate and higher ed collaborating. And the challenges of these different technology platforms working together for a brighter future.

Corporate and higher education May webinar.pptx

Rustici Software

MS Copilot expands with MS Graph connectors

Nanddeep Nachan

Following the popularity of “Cloud Revolution: Exploring the New Wave of Serverless Spatial Data,” we’re thrilled to announce this much-anticipated encore webinar. In this sequel, we’ll dive deeper into the Cloud-Native realm by uncovering practical applications and FME support for these new formats, including COGs, COPC, FlatGeoBuf, GeoParquet, STAC, and ZARR. Building on the foundation laid by industry leaders Michelle Roby of Radiant Earth and Chris Holmes of Planet in the first webinar, this second part offers an in-depth look at the real-world application and behind-the-scenes dynamics of these cutting-edge formats. We will spotlight specific use-cases and workflows, showcasing their efficiency and relevance in practical scenarios. Discover the vast possibilities each format holds, highlighted through detailed discussions and demonstrations. Our expert speakers will dissect the key aspects and provide critical takeaways for effective use, ensuring attendees leave with a thorough understanding of how to apply these formats in their own projects. Elevate your understanding of how FME supports these cutting-edge technologies, enhancing your ability to manage, share, and analyze spatial data. Whether you’re building on knowledge from our initial session or are new to the serverless spatial data landscape, this webinar is your gateway to mastering cloud-native formats in your workflows.

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Safe Software

Three things you will take away from the session: • How to run an effective tenant-to-tenant migration • Best practices for before, during, and after migration • Tips for using migration as a springboard to prepare for Copilot in Microsoft 365 Main ideas: Migration Overview: The presentation covers the current reality of cross-tenant migrations, the triggers, phases, best practices, and benefits of a successful tenant migration Considerations: When considering a migration, it is important to consider the migration scope, performance, customization, flexibility, user-friendly interface, automation, monitoring, support, training, scalability, data integrity, data security, cost, and licensing structure Next Wave: The next wave of change includes the launch of Copilot, which requires businesses to be prepared for upcoming changes related to Copilot and the cloud, and to consolidate data and tighten governance ShareGate: ShareGate can help with pre-migration analysis, configurable migration tool, and automated, end-user driven collaborative governance

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

sammart93

Join our latest Connector Corner webinar to discover how UiPath Integration Service revolutionizes API-centric automation in a 'Quote to Cash' process—and how that automation empowers businesses to accelerate revenue generation. A comprehensive demo will explore connecting systems, GenAI, and people, through powerful pre-built connectors designed to speed process cycle times. Speakers: James Dickson, Senior Software Engineer Charlie Greenberg, Host, Product Marketing Manager

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

DianaGray10

Tracing the root cause of a performance issue requires a lot of patience, experience, and focus. It’s so hard that we sometimes attempt to guess by trying out tentative fixes, but that usually results in frustration, messy code, and a considerable waste of time and money. This talk explains how to correctly zoom in on a performance bottleneck using three levels of profiling: distributed tracing, metrics, and method profiling. After we learn to read the JVM profiler output as a flame graph, we explore a series of bottlenecks typical for backend systems, like connection/thread pool starvation, invisible aspects, blocking code, hot CPU methods, lock contention, and Virtual Thread pinning, and we learn to trace them even if they occur in library code you are not familiar with. Attend this talk and prepare for the performance issues that will eventually hit any successful system. About authorWith two decades of experience, Victor is a Java Champion working as a trainer for top companies in Europe. Five thousands developers in 120 companies attended his workshops, so he gets to debate every week the challenges that various projects struggle with. In return, Victor summarizes key points from these workshops in conference talks and online meetups for the European Software Crafters, the world’s largest developer community around architecture, refactoring, and testing. Discover how Victor can help you on victorrentea.ro : company training catalog, consultancy and YouTube playlists.

Finding Java's Hidden Performance Traps @ DevoxxUK 2024

Victor Rentea

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

Edi Saputra

presentation ICT roal in 21st century education

jfdjdjcjdnsjd

Passkeys: Developing APIs to enable passwordless authentication Cody Salas, Sr Developer Advocate | Solutions Architect - Yubico Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...

apidays

💉💊+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHABI}}+971581248768 +971581248768 Mtp-Kit (500MG) Prices » Dubai [(+971581248768**)] Abortion Pills For Sale In Dubai, UAE, Mifepristone and Misoprostol Tablets Available In Dubai, UAE CONTACT DR.Maya Whatsapp +971581248768 We Have Abortion Pills / Cytotec Tablets /Mifegest Kit Available in Dubai, Sharjah, Abudhabi, Ajman, Alain, Fujairah, Ras Al Khaimah, Umm Al Quwain, UAE, Buy cytotec in Dubai +971581248768''''Abortion Pills near me DUBAI | ABU DHABI|UAE. Price of Misoprostol, Cytotec” +971581248768' Dr.DEEM ''BUY ABORTION PILLS MIFEGEST KIT, MISOPROTONE, CYTOTEC PILLS IN DUBAI, ABU DHABI,UAE'' Contact me now via What's App…… abortion Pills Cytotec also available Oman Qatar Doha Saudi Arabia Bahrain Above all, Cytotec Abortion Pills are Available In Dubai / UAE, you will be very happy to do abortion in Dubai we are providing cytotec 200mg abortion pill in Dubai, UAE. Medication abortion offers an alternative to Surgical Abortion for women in the early weeks of pregnancy. We only offer abortion pills from 1 week-6 Months. We then advise you to use surgery if its beyond 6 months. Our Abu Dhabi, Ajman, Al Ain, Dubai, Fujairah, Ras Al Khaimah (RAK), Sharjah, Umm Al Quwain (UAQ) United Arab Emirates Abortion Clinic provides the safest and most advanced techniques for providing non-surgical, medical and surgical abortion methods for early through late second trimester, including the Abortion By Pill Procedure (RU 486, Mifeprex, Mifepristone, early options French Abortion Pill), Tamoxifen, Methotrexate and Cytotec (Misoprostol). The Abu Dhabi, United Arab Emirates Abortion Clinic performs Same Day Abortion Procedure using medications that are taken on the first day of the office visit and will cause the abortion to occur generally within 4 to 6 hours (as early as 30 minutes) for patients who are 3 to 12 weeks pregnant. When Mifepristone and Misoprostol are used, 50% of patients complete in 4 to 6 hours; 75% to 80% in 12 hours; and 90% in 24 hours. We use a regimen that allows for completion without the need for surgery 99% of the time. All advanced second trimester and late term pregnancies at our Tampa clinic (17 to 24 weeks or greater) can be completed within 24 hours or less 99% of the time without the need surgery. The procedure is completed with minimal to no complications. Our Women's Health Center located in Abu Dhabi, United Arab Emirates, uses the latest medications for medical abortions (RU-486, Mifeprex, Mifegyne, Mifepristone, early options French abortion pill), Methotrexate and Cytotec (Misoprostol). The safety standards of our Abu Dhabi, United Arab Emirates Abortion Doctors remain unparalleled. They consistently maintain the lowest complication rates throughout the nation. Our Physicians and staff are always available to answer questions and care for women in one of the most difficult times in their lives. The decision to have an abortion at the Abortion Cl

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

Understanding the FAA Part 107 License ..

Christopher Logan Kennedy

ICT role in 21st century education and its challenges

rafiqahmad00786416

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

Zilliz

The value of a flexible API Management solution for Open Banking Steve Melan, Manager for IT Innovation and Architecture - State's and Saving's Bank of Luxembourg Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - The value of a flexible API Management solution for O...

apidays

Whatsapp Number Escorts Call girls 8617370543 Available 24x7 Mcleodganj Call Girls Service Offer Genuine VIP Model Escorts Call Girls in Your Budget. Mcleodganj Call Girls Service Provide Real Call Girls Number. Make Your Sexual Pleasure Memorable with Our Mcleodganj Call Girls at Affordable Price. Top VIP Escorts Call Girls, High Profile Independent Escorts Call Girls, Housewife Women Escorts Call Girl, College Girls Escorts Call Girls, Russian Escorts Call girls Service in Your Budget.

Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model

Deepika Singh

Architecting Cloud Native Applications

WSO2

Discover the innovative features and strategic vision that keep WSO2 an industry leader. Explore the exciting 2024 roadmap of WSO2 API management, showcasing innovations, unified APIM/APK control plane, natural language API interaction, and cloud native agility. Discover how open source solutions, microservices architecture, and cloud native technologies unlock seamless API management in today's dynamic landscapes. Leave with a clear blueprint to revolutionize your API journey and achieve industry success!

WSO2's API Vision: Unifying Control, Empowering Developers

WSO2

💥 You’re lucky! We’ve found two different (lead) developers that are willing to share their valuable lessons learned about using UiPath Document Understanding! Based on recent implementations in appealing use cases at Partou and SPIE. Don’t expect fancy videos or slide decks, but real and practical experiences that will help you with your own implementations. 📕 Topics that will be addressed: • Training the ML-model by humans: do or don't? • Rule-based versus AI extractors • Tips for finding use cases • How to start 👨‍🏫👨‍💻 Speakers: o Dion Morskieft, RPA Product Owner @Partou o Jack Klein-Schiphorst, Automation Developer @Tacstone Technology

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam

UiPathCommunity

Vector Search -An Introduction in Oracle Database 23ai.pptx

Remote DBA Services

Recently uploaded (20)

MINDCTI Revenue Release Quarter One 2024

Corporate and higher education May webinar.pptx

MS Copilot expands with MS Graph connectors

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

Finding Java's Hidden Performance Traps @ DevoxxUK 2024

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

presentation ICT roal in 21st century education

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

Understanding the FAA Part 107 License ..

ICT role in 21st century education and its challenges

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

Apidays New York 2024 - The value of a flexible API Management solution for O...

Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model

Architecting Cloud Native Applications

WSO2's API Vision: Unifying Control, Empowering Developers

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam

Vector Search -An Introduction in Oracle Database 23ai.pptx

Spark in Action #CodeMania11

1. Spark in Action from parallel processing perspective ณัฐวุฒิ หนูไพโรจน์ (อาจารย์เป้ ง) natawut.n@chula.ac.th http://natawutn.wordpress.com @natawutn

2. “ผมอยากให้มีภาพว่าอายุไม่ใช่อุปสรรคสาหรับ การเขียนโค้ด ต่างประเทศยังเขียนกันผมหงอก” ผู้จัดงาน Code Mania#11 ท่านหนึ่งได้กล่าวไว้

3. = Parallel Programming Reborn

4. Very Small Big Data Example

5. Storage Requirements for 90 days = 39,000,000,000 events (6.5TB)

6. #เรามาถึงจุดนี้ได้อย่างไร

7. MapReduce

8. Source: https://www.mssqltips.com/sqlservertip/3222/big-data-basics--part-5--introduction-to- mapreduce/

9. Hadoop Shortcomings × Too simplified × Not interactive × Pessimistic

10. Along come Apache Spark! • Opensource Big Data Framework from UC Berkeley • In-Memory Analytics • > 800 contributors • Largest Cluster = 8,000+ nodes (Tencent)

11. Source: M. Zaharia, “New Directions for Spark in 2015”, Spark Summit East 2015, 18 March 2015.

12.

13. Let’s get to Action!

14. Action#1 – Wifi Radius Log • Size = 336.6 MB • Sample Data 2016-01-01T00:00:08.000+07:00 acs1121-cen59-1 CSCOacs_Passed_Authentications 0137798608 4 0 2016- 01-01 00:00:08.198 +07:00 0113947311 5200 NOTICE Passed-Authentication: Authentication succeeded, ACSVersion=acs-5.6.0.22-B.225, ConfigVe... 2016-01-01T00:00:13.000+07:00 acs1121-cen59-1 CSCOacs_RADIUS_Accounting 0137798618 2 0 2016-01-01 00:00:13.644 +07:00 0113947536 3000 NOTICE Radius- Accounting: RADIUS Accounting start request, ACSV...

15. Action#1 – Wifi Radius Log val loglines = sc.textFile("wifi-radius.log") val successlogin = loglines.filter( line => line.contains("Authentication succeeded")) val logintime = successlogin.map( line => line.split(" ")(0)) val logintimepair = logintime.map( ltime => (ltime.substring(11,13), 1)) val loginByHour = logintimepair.reduceByKey( (x,y) => x + y).sortByKey() loginByHour.collect().foreach(println)

16. For Spark: RDD is everything!

17.

18.

19.

20. Action#2 – Thai Monitoring Corpus • Size = 40 x 200 MB • Sample Data {"ngram": 8, "totallen": 39, "wordseg": "}อยาก}ไป~นอน}ที่~นี่}<s>}วิว} สวย}<s>}รอ}ชม}ค่า}<s>", "text": "อยากไปนอนที่นี่ วิวสวยยยย <img class="img-in-emotion" title="อมยิ้ม02" alt="อมยิ้ม02“ src="http://ptcdn.info/emoticons/smiley/smiley02.png"/><br />nรอชม ค่าาาา", "timestamp": 1392286781000, "id": 21935913, "faillen": 0, "refid": "31650001"} {"ngram": 6, "totallen": 26, "wordseg": "ตาม}มา~เที่ยว}ด้วย~กัน}ค่ะ}คุณ} วา}<s>}", …

21. Action#2 – Thai Monitoring Corpus val data = sqlContext.read.json("gs://tmc-data/*") val d = data.withColumn("date", from_unixtime(data("timestamp")/1000, "yyyy-MM-dd")) val failre = "<[Ff][^>]+>[^<]+</[Ff][^>]+>".r val nonthaire = "[u0021-u007F]+".r def prepareText(x:Any):String = { var text = x.asInstanceOf[String] text = text.replace("n"," ") text = text.replace("<s>"," ") text = text.replace("~","") text = text.replace("}"," ") text = failre.replaceAllIn(text, " ") text = nonthaire.replaceAllIn(text, " ") return text } val dTmp = d.select(d("date"), d("wordseg")) val dataMap = dTmp.map(item => (item(0), prepareText(item(1)))) val baseGramRDD = dataMap.flatMapValues(v => v.split("( )+")) val baseGramCountRDD = baseGramRDD.countByValue()

22. Source: P.Zecevic and M.Bonaci, Spark in Action, Manning Publications, Summer 2016 (est.)

23. Why I like Spark? • RDD: Simple but Elegant Design • Very consistent abstraction • Extensible with minimal overheads • One feature added to base RDD, all other modules get benefits

24. Big Data + Spark = #คู่จิ้นฟินเวอร์

Spark in Action #CodeMania11

Recommended

Recommended

More Related Content

Similar to Spark in Action #CodeMania11

Similar to Spark in Action #CodeMania11 (20)

More from Chulalongkorn University

More from Chulalongkorn University (7)

Recently uploaded

Recently uploaded (20)

Spark in Action #CodeMania11