Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Python in big data ecosystem by Nicholas Lu

308 views

Published on

Session on python in the big data ecosystem by Nicholas Lu for PyCon APAC 2017

Published in: Software
  • Be the first to comment

Python in big data ecosystem by Nicholas Lu

  1. 1. Python in Big Data Ecosystem Nicholas Lu (Chee Seng) PyCon Malaysia 2017
  2. 2. About me: Physics and Mathematics Major. ETL developer for Warner Chappell. Glowing passion on yellow elephant ecosystem. A pip and apt-get guy. Uses vim and tab. github.com/lucheeseng827
  3. 3. Why do we need Py in the Tonne World of heavy jvm and low level language as performance vs simplicity Total of data is immense RAMs are getting cheaper Less code = Less error = Less time of development
  4. 4. What’s SMACK
  5. 5. 1. Intro to SMACK ➔ Spark In-memory processing does make stuff faster and more efficient. ➔ Kafka How many type of straws are we using to dry up a water tank. ➔ Cassandra Storage of data in multiple computer does make it faster.
  6. 6. ➔ Mesos Containerized environment for ease of scalability and management. ➔ Akka High concurrency for better utilization and effective processes
  7. 7. Why SMACK
  8. 8. Python in everything Python Package SMACK Big Guy Developer Kafka Confluent, Pykafka, Python- Kafka Kafka Confluent Pyspark Spark DataBricks Mesos-python Mesos Mesosphere Cassandra-driver Cassandra DataStax Pykka Akka Unknown
  9. 9. http://www.natalinobusa.com/2015/11/why-is-smack-stack-all-rage-lately.html
  10. 10. https://content.pivotal.io/blog/understanding-when-to-use-rabbitmq-or-apache-kafka
  11. 11. When do you SMACK?
  12. 12. Dealing with mixed volume and velocity Doing ETL/ELT (fixed schedule, move around) Prefer speedy micro batch over classic batch process(second vs minutes) Plan to upgrade more features in the coming time
  13. 13. Tools Needed Python Docker(kafka, cassandra) Spark Some big flat file A decently rammed computer
  14. 14. 2. Flow Sequence for the data processing flow ➔ Pipe them in Show me the data. ➔ Collect and Subscribe Customer data in channel 4 and Finance in channel 2 ➔ Process in Batch Release the Kraken! ➔ Process On-The-Go Near real time processing for higher urgency
  15. 15. https://intellipaat.com/tutorial/cassandra-tutorial/brief-architecture-of-cassandra/
  16. 16. https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-architecture.html
  17. 17. SMACK
  18. 18. SMACK time
  19. 19. SMACK time
  20. 20. SMACK time
  21. 21. Then, Marcos discovered SMACK He has his interest in Python completely revived. He’s able to give every project a great SMACK. Project that provides client fast analytics at scale.
  22. 22. What’s next? Flink implementation Apache Beam implementation ML implementation Implementation of Caching DataFrames SQL in spark streaming DC/OS(Multi cloud tenancy) Many more
  23. 23. Q&A
  24. 24. Thank you! For more about making this demo better (please do give feedback to lu.cheeseng827@gmail.com)

×