Python in Big Data
Ecosystem
Nicholas Lu (Chee Seng)
PyCon Malaysia 2017
About me:
Physics and Mathematics Major. ETL developer for Warner Chappell. Glowing
passion on yellow elephant ecosystem. A pip and apt-get guy. Uses vim and
tab.
github.com/lucheeseng827
Why do we need Py
in the Tonne
World of heavy jvm and low level language as performance vs
simplicity
Total of data is immense
RAMs are getting cheaper
Less code = Less error = Less time of development
What’s SMACK
1. Intro to SMACK
➔ Spark
In-memory processing does make stuff
faster and more efficient.
➔ Kafka
How many type of straws are we using
to dry up a water tank.
➔ Cassandra
Storage of data in multiple computer
does make it faster.
➔ Mesos
Containerized environment for
ease of scalability and
management.
➔ Akka
High concurrency for better utilization
and effective processes
Why SMACK
Python in everything
Python Package SMACK Big Guy Developer
Kafka Confluent,
Pykafka, Python-
Kafka
Kafka Confluent
Pyspark Spark DataBricks
Mesos-python Mesos Mesosphere
Cassandra-driver Cassandra DataStax
Pykka Akka Unknown
http://www.natalinobusa.com/2015/11/why-is-smack-stack-all-rage-lately.html
https://content.pivotal.io/blog/understanding-when-to-use-rabbitmq-or-apache-kafka
When do you SMACK?
Dealing with mixed volume and velocity
Doing ETL/ELT (fixed schedule, move around)
Prefer speedy micro batch over classic batch process(second vs
minutes)
Plan to upgrade more features in the coming time
Tools Needed
Python
Docker(kafka, cassandra)
Spark
Some big flat file
A decently rammed computer
2. Flow
Sequence for the data processing flow
➔ Pipe them in
Show me the data.
➔ Collect and Subscribe
Customer data in channel 4 and
Finance in channel 2
➔ Process in Batch
Release the Kraken!
➔ Process On-The-Go
Near real time processing for higher
urgency
https://intellipaat.com/tutorial/cassandra-tutorial/brief-architecture-of-cassandra/
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-architecture.html
SMACK
SMACK time
SMACK time
SMACK time
Then, Marcos discovered
SMACK
He has his interest in Python
completely revived.
He’s able to give every project a great
SMACK. Project that provides client
fast analytics at scale.
What’s next?
Flink implementation
Apache Beam implementation
ML implementation
Implementation of Caching
DataFrames
SQL in spark streaming
DC/OS(Multi cloud tenancy)
Many more
Q&A
Thank you!
For more about making this demo better
(please do give feedback to
lu.cheeseng827@gmail.com)

Python in big data ecosystem by Nicholas Lu

Editor's Notes

  • #3 Problem statement
  • #4 How does python fair in big data world jiji World of heavy jvm and low level language as performance vs simplicity Total of data is immense RAMs areHow does python fair in big data world World of heavy jvm and low level language as performance vs simplicity Total of data is immense RAMs are getting cheaper getting cheaper
  • #15 When you want to put scalable processing up in speed, processing high bandwidth of logs and transaction
  • #19 Explain what is happening in the backend, form data collection,