Forwarded from Python | Machine Learning | Coding | R
📁 I've brought you 10 of the best portfolios from data science professionals, each of whom has followed a unique path! Check out these 10 and get inspired to build a strong portfolio of your own!👇
#DataAnalytics #Python #SQL #RProgramming #DataScience #MachineLearning #DeepLearning #Statistics #DataVisualization #PowerBI #Tableau #LinearRegression #Probability #DataWrangling #Excel #AI #ArtificialIntelligence #BigData #DataAnalysis #NeuralNetworks #SupervisedLearning #LearnDataScience #LLM #RAG #Mathematics #PythonProgramming
https://yangx.top/CodeProgrammer🧠
Please open Telegram to view this post
VIEW IN TELEGRAM
👍7❤1
Forwarded from Python | Machine Learning | Coding | R
Machine Learning Glossary
Brief visual explanations of machine learning concepts with diagrams, code examples and links to resources for learning more.
Link: https://ml-cheatsheet.readthedocs.io/en/latest/index.html
Brief visual explanations of machine learning concepts with diagrams, code examples and links to resources for learning more.
Link: https://ml-cheatsheet.readthedocs.io/en/latest/index.html
#DataAnalytics #Python #SQL #RProgramming #DataScience #MachineLearning #DeepLearning #Statistics #DataVisualization #PowerBI #Tableau #LinearRegression #Probability #DataWrangling #Excel #AI #ArtificialIntelligence #BigData #DataAnalysis #NeuralNetworks #GAN #LearnDataScience #LLM #RAG #Mathematics #PythonProgramming #Keras
https://yangx.top/CodeProgrammer
👍6
Forwarded from Python | Machine Learning | Coding | R
Media is too big
VIEW IN TELEGRAM
The program covers topics of #NLP, #CV, #LLM and the use of technology in medicine, offering a full cycle of training - from theory to practical classes using current versions of libraries.
The course is designed even for beginners: if you know how to take derivatives and multiply matrices, everything else will be explained in the process.
The lectures are released for free on YouTube and the #MIT platform on Mondays, with the first one already available
.
All slides, #code and additional materials can be found at the link provided.
📌 Fresh lecture : https://youtu.be/alfdI7S6wCY?si=6682DD2LlFwmghew
#DataAnalytics #Python #SQL #RProgramming #DataScience #MachineLearning #DeepLearning #Statistics #DataVisualization #PowerBI #Tableau #LinearRegression #Probability #DataWrangling #Excel #AI #ArtificialIntelligence #BigData #DataAnalysis #NeuralNetworks #GAN #LearnDataScience #LLM #RAG #Mathematics #PythonProgramming #Keras
https://yangx.top/CodeProgrammer✅
Please open Telegram to view this post
VIEW IN TELEGRAM
👍10
Forwarded from Python | Machine Learning | Coding | R
Numpy @CodeProgrammer.pdf
813.2 KB
👨🏻💻 For the past few days, I've been busy preparing this comprehensive tutorial on the NumPy library for data science, trying to cover all the tips and tricks of this library.
#DataAnalytics #Python #SQL #RProgramming #DataScience #MachineLearning #DeepLearning #Statistics #DataVisualization #PowerBI #Tableau #LinearRegression #Probability #DataWrangling #Excel #AI #ArtificialIntelligence #BigData #DataAnalysis #NeuralNetworks #GAN #LearnDataScience #LLM #RAG #Mathematics #PythonProgramming #Keras
https://yangx.top/CodeProgrammer✅
Please open Telegram to view this post
VIEW IN TELEGRAM
👍9🔥4❤2
This real-world project tutorial covers zero-shot and few-shot prompting, delimiters, numbered steps, role prompts, chain-of-thought prompting, and more. Improve your LLM-assisted projects today.
Link: https://realpython.com/practical-prompt-engineering/
#DataAnalytics #Python #SQL #RProgramming #DataScience #MachineLearning #DeepLearning #Statistics #DataVisualization #PowerBI #Tableau #LinearRegression #Probability #DataWrangling #Excel #AI #ArtificialIntelligence #BigData #DataAnalysis #NeuralNetworks #GAN #LearnDataScience #LLM #RAG #Mathematics #PythonProgramming #Keras
https://yangx.top/CodeProgrammer✅
Please open Telegram to view this post
VIEW IN TELEGRAM
👍6
Forwarded from Python | Machine Learning | Coding | R
👨🏻💻 "Where do I start now?" This was the first and biggest question I faced when I started my Data Science learning journey!
┌
├
└
#DataAnalytics #Python #SQL #RProgramming #DataScience #MachineLearning #DeepLearning #Statistics #DataVisualization #PowerBI #Tableau #LinearRegression #Probability #DataWrangling #Excel #AI #ArtificialIntelligence #BigData #DataAnalysis #NeuralNetworks #GAN #LearnDataScience #LLM #RAG #Mathematics #PythonProgramming #Keras
https://yangx.top/CodeProgrammer✅
Please open Telegram to view this post
VIEW IN TELEGRAM
👍9
Forwarded from Python | Machine Learning | Coding | R
course lecture on building Transformers from first principles:
https://www.dropbox.com/scl/fi/jhfgy8dnnvy5qq385tnms/lectureattentionneuralnetworks.pdf?rlkey=fddnkonsez76mf8bzider3hrv&dl=0
The #PyTorch notebooks also demonstrate how to implement #Transformers from scratch:
https://github.com/xbresson/CS52422025/tree/main/labslecture07
https://www.dropbox.com/scl/fi/jhfgy8dnnvy5qq385tnms/lectureattentionneuralnetworks.pdf?rlkey=fddnkonsez76mf8bzider3hrv&dl=0
The #PyTorch notebooks also demonstrate how to implement #Transformers from scratch:
https://github.com/xbresson/CS52422025/tree/main/labslecture07
#DataAnalytics #Python #SQL #RProgramming #DataScience #MachineLearning #DeepLearning #Statistics #DataVisualization #PowerBI #Tableau #LinearRegression #Probability #DataWrangling #Excel #AI #ArtificialIntelligence #BigData #DataAnalysis #NeuralNetworks #GAN #LearnDataScience #LLM #RAG #Mathematics #PythonProgramming #Keras
https://yangx.top/CodeProgrammer✅
Please open Telegram to view this post
VIEW IN TELEGRAM
Please open Telegram to view this post
VIEW IN TELEGRAM
👍8
Forwarded from Python | Machine Learning | Coding | R
Pandas Introduction to Advanced.pdf
854.8 KB
👨🏻💻 You can't attend a #datascience interview and not be asked about Pandas! But you don't have to memorize all its methods and functions! With this booklet, you'll learn everything you need.
#DataAnalytics #Python #SQL #RProgramming #DataScience #MachineLearning #DeepLearning #Statistics #DataVisualization #PowerBI #Tableau #LinearRegression #Probability #DataWrangling #Excel #AI #ArtificialIntelligence #BigData #DataAnalysis #NeuralNetworks #GAN #LearnDataScience #LLM #RAG #Mathematics #PythonProgramming #Keras
https://yangx.top/CodeProgrammer✅
Please open Telegram to view this post
VIEW IN TELEGRAM
👍13
Forwarded from Python | Machine Learning | Coding | R
Find these FREE AI Courses here 👇
https://www.mltut.com/best-resources-to-learn-artificial-intelligence/
https://www.mltut.com/best-resources-to-learn-artificial-intelligence/
#DataAnalytics #Python #SQL #RProgramming #DataScience #MachineLearning #DeepLearning #Statistics #DataVisualization #PowerBI #Tableau #LinearRegression #Probability #DataWrangling #Excel #AI #ArtificialIntelligence #BigData #DataAnalysis #NeuralNetworks #GAN #LearnDataScience #LLM #RAG #Mathematics #PythonProgramming #Keras
https://yangx.top/CodeProgrammer✅
Please open Telegram to view this post
VIEW IN TELEGRAM
👍8
Forwarded from Python | Machine Learning | Coding | R
Exercises in Machine Learning
This book contains 75+ exercises
Download, read, and practice:
arxiv.org/pdf/2206.13446
GitHub Repo: https://github.com/michaelgutmann/ml-pen-and-paper-exercises
This book contains 75+ exercises
Download, read, and practice:
arxiv.org/pdf/2206.13446
GitHub Repo: https://github.com/michaelgutmann/ml-pen-and-paper-exercises
#DataAnalytics #Python #SQL #RProgramming #DataScience #MachineLearning #DeepLearning #Statistics #DataVisualization #PowerBI #Tableau #LinearRegression #Probability #DataWrangling #Excel #AI #ArtificialIntelligence #BigData #DataAnalysis #NeuralNetworks #GAN #LearnDataScience #LLM #RAG #Mathematics #PythonProgramming #Keras
https://yangx.top/CodeProgrammer ✅
👍9
Forwarded from Python | Machine Learning | Coding | R
Linear Algebra
The 2nd best book on linear algebra with ~1000 practice problems. A MUST for AI & Machine Learning.
Completely FREE.
Download it: https://www.cs.ox.ac.uk/files/12921/book.pdf
The 2nd best book on linear algebra with ~1000 practice problems. A MUST for AI & Machine Learning.
Completely FREE.
Download it: https://www.cs.ox.ac.uk/files/12921/book.pdf
#DataAnalytics #Python #SQL #RProgramming #DataScience #MachineLearning #DeepLearning #Statistics #DataVisualization #PowerBI #Tableau #LinearRegression #Probability #DataWrangling #Excel #AI #ArtificialIntelligence #BigData #DataAnalysis #NeuralNetworks #GAN #LearnDataScience #LLM #RAG #Mathematics #PythonProgramming #Keras
https://yangx.top/CodeProgrammer✅
Please open Telegram to view this post
VIEW IN TELEGRAM
Please open Telegram to view this post
VIEW IN TELEGRAM
👍4❤2
#MachineLearning Systems — Principles and Practices of Engineering Artificially Intelligent Systems: https://mlsysbook.ai/
open-source textbook focuses on how to design and implement AI systems effectively
open-source textbook focuses on how to design and implement AI systems effectively
#DataAnalytics #Python #SQL #RProgramming #DataScience #MachineLearning #DeepLearning #Statistics #DataVisualization #PowerBI #Tableau #LinearRegression #Probability #DataWrangling #Excel #AI #ArtificialIntelligence #BigData #DataAnalysis #NeuralNetworks #GAN #LearnDataScience #LLM #RAG #Mathematics #PythonProgramming #Keras
https://yangx.top/DataScienceM✅
Please open Telegram to view this post
VIEW IN TELEGRAM
❤5👍3
Forwarded from Python | Machine Learning | Coding | R
This book is for readers looking to learn new #machinelearning algorithms or understand algorithms at a deeper level. Specifically, it is intended for readers interested in seeing machine learning algorithms derived from start to finish. Seeing these derivations might help a reader previously unfamiliar with common algorithms understand how they work intuitively. Or, seeing these derivations might help a reader experienced in modeling understand how different #algorithms create the models they do and the advantages and disadvantages of each one.
This book will be most helpful for those with practice in basic modeling. It does not review best practices—such as feature engineering or balancing response variables—or discuss in depth when certain models are more appropriate than others. Instead, it focuses on the elements of those models.
https://dafriedman97.github.io/mlbook/content/introduction.html
#DataAnalytics #Python #SQL #RProgramming #DataScience #MachineLearning #DeepLearning #Statistics #DataVisualization #PowerBI #Tableau #LinearRegression #Probability #DataWrangling #Excel #AI #ArtificialIntelligence #BigData #DataAnalysis #NeuralNetworks #GAN #LearnDataScience #LLM #RAG #Mathematics #PythonProgramming #Keras
https://yangx.top/CodeProgrammer✅
Please open Telegram to view this post
VIEW IN TELEGRAM
👍4❤2
Forwarded from Python | Machine Learning | Coding | R
"Introduction to Probability for Data Science"
One of the best books on #Probability. Available FREE.
Download the book:
probability4datascience.com/download.html
One of the best books on #Probability. Available FREE.
Download the book:
probability4datascience.com/download.html
#DataAnalytics #Python #SQL #RProgramming #DataScience #MachineLearning #DeepLearning #Statistics #DataVisualization #PowerBI #Tableau #LinearRegression #Probability #DataWrangling #Excel #AI #ArtificialIntelligence #BigData #DataAnalysis #NeuralNetworks #GAN #LearnDataScience #LLM #RAG #Mathematics #PythonProgramming #Keras
https://yangx.top/CodeProgrammer✅
Please open Telegram to view this post
VIEW IN TELEGRAM
Please open Telegram to view this post
VIEW IN TELEGRAM
👍7❤1
Forwarded from Python | Machine Learning | Coding | R
SciPy.pdf
206.4 KB
Unlock the full power of SciPy with my comprehensive cheat sheet!
Master essential functions for:
Function optimization and solving equations
Linear algebra operations
ODE integration and statistical analysis
Signal processing and spatial data manipulation
Data clustering and distance computation ...and much more!
💯 BEST DATA SCIENCE CHANNELS ON TELEGRAM 🌟
Master essential functions for:
Function optimization and solving equations
Linear algebra operations
ODE integration and statistical analysis
Signal processing and spatial data manipulation
Data clustering and distance computation ...and much more!
#Python #SciPy #MachineLearning #DataScience #CheatSheet #ArtificialIntelligence #Optimization #LinearAlgebra #SignalProcessing #BigData
Please open Telegram to view this post
VIEW IN TELEGRAM
👍5
Forwarded from Python | Machine Learning | Coding | R
Numpy from basics to advanced.pdf
2.4 MB
NumPy is an essential library in the world of data science, widely recognized for its efficiency in numerical computations and data manipulation. This powerful tool simplifies complex operations with arrays, offering a faster and cleaner alternative to traditional Python lists and loops.
The "Mastering NumPy" booklet provides a comprehensive walkthrough—from array creation and indexing to mathematical/statistical operations and advanced topics like reshaping and stacking. All concepts are illustrated with clear, beginner-friendly examples, making it ideal for anyone aiming to boost their data handling skills.
#NumPy #Python #DataScience #MachineLearning #AI #BigData #DeepLearning #DataAnalysis
✉️ Our Telegram channels: https://yangx.top/addlist/0f6vfFbEMdAwODBk📱 Our WhatsApp channel: https://whatsapp.com/channel/0029VaC7Weq29753hpcggW2A
Please open Telegram to view this post
VIEW IN TELEGRAM
❤4👍1
Topic: Python PySpark Data Sheet – Part 1 of 3: Introduction, Setup, and Core Concepts
---
### 1. What is PySpark?
PySpark is the Python API for Apache Spark, a powerful distributed computing engine for big data processing.
PySpark allows you to leverage the full power of Apache Spark using Python, making it easier to:
• Handle massive datasets
• Perform distributed computing
• Run parallel data transformations
---
### 2. PySpark Ecosystem Components
• Spark SQL – Structured data queries with DataFrame and SQL APIs
• Spark Core – Fundamental engine for task scheduling and memory management
• Spark Streaming – Real-time data processing
• MLlib – Machine learning at scale
• GraphX – Graph computation
---
### 3. Why PySpark over Pandas?
| Feature | Pandas | PySpark |
| -------------- | --------------------- | ----------------------- |
| Scale | Single machine | Distributed (Cluster) |
| Speed | Slower for large data | Optimized execution |
| Language | Python | Python on JVM via Py4J |
| Learning Curve | Easier | Medium (Big Data focus) |
---
### 4. PySpark Setup in Local Machine
#### Install PySpark via pip:
#### Start PySpark Shell:
#### Sample Code to Initialize SparkSession:
---
### 5. RDD vs DataFrame
| Feature | RDD | DataFrame |
| ------------ | ----------------------- | ------------------------------ |
| Type | Low-level API (objects) | High-level API (structured) |
| Optimization | Manual | Catalyst Optimizer (automatic) |
| Usage | Complex transformations | SQL-like operations |
---
### 6. Creating DataFrames
#### From Python List:
#### From CSV File:
---
### 7. Inspecting DataFrames
---
### 8. Basic Transformations
---
### 9. Working with SQL
---
### 10. Writing Data
---
### 11. Summary of Concepts Covered
• Spark architecture & PySpark setup
• Core components of PySpark
• Differences between RDD and DataFrames
• How to create, inspect, and manipulate DataFrames
• SQL support in Spark
• Reading/writing to/from storage
---
### Exercise
1. Load a sample CSV file and display the schema
2. Add a new column with a calculated value
3. Filter the rows based on a condition
4. Save the result as a new CSV or Parquet file
---
#Python #PySpark #BigData #ApacheSpark #DataEngineering #ETL
https://yangx.top/DataScienceM
---
### 1. What is PySpark?
PySpark is the Python API for Apache Spark, a powerful distributed computing engine for big data processing.
PySpark allows you to leverage the full power of Apache Spark using Python, making it easier to:
• Handle massive datasets
• Perform distributed computing
• Run parallel data transformations
---
### 2. PySpark Ecosystem Components
• Spark SQL – Structured data queries with DataFrame and SQL APIs
• Spark Core – Fundamental engine for task scheduling and memory management
• Spark Streaming – Real-time data processing
• MLlib – Machine learning at scale
• GraphX – Graph computation
---
### 3. Why PySpark over Pandas?
| Feature | Pandas | PySpark |
| -------------- | --------------------- | ----------------------- |
| Scale | Single machine | Distributed (Cluster) |
| Speed | Slower for large data | Optimized execution |
| Language | Python | Python on JVM via Py4J |
| Learning Curve | Easier | Medium (Big Data focus) |
---
### 4. PySpark Setup in Local Machine
#### Install PySpark via pip:
pip install pyspark
#### Start PySpark Shell:
pyspark
#### Sample Code to Initialize SparkSession:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("MyApp") \
.getOrCreate()
---
### 5. RDD vs DataFrame
| Feature | RDD | DataFrame |
| ------------ | ----------------------- | ------------------------------ |
| Type | Low-level API (objects) | High-level API (structured) |
| Optimization | Manual | Catalyst Optimizer (automatic) |
| Usage | Complex transformations | SQL-like operations |
---
### 6. Creating DataFrames
#### From Python List:
data = [("Alice", 25), ("Bob", 30)]
df = spark.createDataFrame(data, ["Name", "Age"])
df.show()
#### From CSV File:
df = spark.read.csv("file.csv", header=True, inferSchema=True)
df.show()
---
### 7. Inspecting DataFrames
df.printSchema() # Schema info
df.columns # List column names
df.describe().show() # Summary stats
df.head(5) # First 5 rows
---
### 8. Basic Transformations
df.select("Name").show()
df.filter(df["Age"] > 25).show()
df.withColumn("AgePlus10", df["Age"] + 10).show()
df.drop("Age").show()
---
### 9. Working with SQL
df.createOrReplaceTempView("people")
spark.sql("SELECT * FROM people WHERE Age > 25").show()
---
### 10. Writing Data
df.write.csv("output.csv", header=True)
df.write.parquet("output_parquet/")
---
### 11. Summary of Concepts Covered
• Spark architecture & PySpark setup
• Core components of PySpark
• Differences between RDD and DataFrames
• How to create, inspect, and manipulate DataFrames
• SQL support in Spark
• Reading/writing to/from storage
---
### Exercise
1. Load a sample CSV file and display the schema
2. Add a new column with a calculated value
3. Filter the rows based on a condition
4. Save the result as a new CSV or Parquet file
---
#Python #PySpark #BigData #ApacheSpark #DataEngineering #ETL
https://yangx.top/DataScienceM
❤4
Topic: Python PySpark Data Sheet – Part 2 of 3: DataFrame Transformations, Joins, and Group Operations
---
### 1. Column Operations
PySpark supports various column-wise operations using expressions.
#### Select Specific Columns:
#### Create/Modify Column:
#### Rename a Column:
#### Drop Column:
---
### 2. Filtering and Conditional Logic
#### Filter Rows:
#### Multiple Conditions:
#### Using `when` for Conditional Columns:
---
### 3. Aggregations and Grouping
#### GroupBy + Aggregations:
#### Using Aggregate Functions:
---
### 4. Sorting and Ordering
#### Sort by One or More Columns:
---
### 5. Dropping Duplicates & Handling Missing Data
#### Drop Duplicates:
#### Drop Rows with Nulls:
#### Fill Null Values:
---
### 6. Joins in PySpark
PySpark supports various join types like SQL.
#### Types of Joins:
•
•
•
•
•
•
#### Example – Inner Join:
#### Left Join Example:
---
### 7. Working with Dates and Timestamps
#### Date Formatting:
---
### 8. Window Functions (Advanced Aggregations)
Used for operations like ranking, cumulative sum, and moving average.
---
### 9. Caching and Persistence
Use caching for performance when reusing data:
Or use:
---
### 10. Summary of Concepts Covered
• Column transformations and renaming
• Filtering and conditional logic
• Grouping, aggregating, and sorting
• Handling nulls and duplicates
• All types of joins
• Working with dates and window functions
• Caching for performance
---
### Exercise
1. Load two CSV datasets and perform different types of joins
2. Add a new column with a custom label based on a condition
3. Aggregate salary data by department and show top-paid employees per department using window functions
4. Practice caching and observe performance
---
#Python #PySpark #DataEngineering #BigData #ETL #ApacheSpark
https://yangx.top/DataScienceM
---
### 1. Column Operations
PySpark supports various column-wise operations using expressions.
#### Select Specific Columns:
df.select("Name", "Age").show()
#### Create/Modify Column:
from pyspark.sql.functions import col
df.withColumn("AgePlus5", col("Age") + 5).show()
#### Rename a Column:
df.withColumnRenamed("Age", "UserAge").show()
#### Drop Column:
df.drop("Age").show()
---
### 2. Filtering and Conditional Logic
#### Filter Rows:
df.filter(col("Age") > 25).show()
#### Multiple Conditions:
df.filter((col("Age") > 25) & (col("Name") != "Alice")).show()
#### Using `when` for Conditional Columns:
from pyspark.sql.functions import when
df.withColumn("Category", when(col("Age") < 30, "Young").otherwise("Adult")).show()
---
### 3. Aggregations and Grouping
#### GroupBy + Aggregations:
df.groupBy("Department").count().show()
df.groupBy("Department").agg({"Salary": "avg"}).show()
#### Using Aggregate Functions:
from pyspark.sql.functions import avg, max, min, count
df.groupBy("Department").agg(
avg("Salary").alias("AvgSalary"),
max("Salary").alias("MaxSalary")
).show()
---
### 4. Sorting and Ordering
#### Sort by One or More Columns:
df.orderBy("Age").show()
df.orderBy(col("Salary").desc()).show()
---
### 5. Dropping Duplicates & Handling Missing Data
#### Drop Duplicates:
df.dropDuplicates(["Name", "Age"]).show()
#### Drop Rows with Nulls:
df.dropna().show()
#### Fill Null Values:
df.fillna({"Salary": 0}).show()
---
### 6. Joins in PySpark
PySpark supports various join types like SQL.
#### Types of Joins:
•
inner
•
left
•
right
•
outer
•
left_semi
•
left_anti
#### Example – Inner Join:
df1.join(df2, on="id", how="inner").show()
#### Left Join Example:
df1.join(df2, on="id", how="left").show()
---
### 7. Working with Dates and Timestamps
from pyspark.sql.functions import current_date, current_timestamp
df.withColumn("today", current_date()).show()
df.withColumn("now", current_timestamp()).show()
#### Date Formatting:
from pyspark.sql.functions import date_format
df.withColumn("formatted", date_format(col("Date"), "yyyy-MM-dd")).show()
---
### 8. Window Functions (Advanced Aggregations)
Used for operations like ranking, cumulative sum, and moving average.
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
window_spec = Window.partitionBy("Department").orderBy("Salary")
df.withColumn("rank", row_number().over(window_spec)).show()
---
### 9. Caching and Persistence
Use caching for performance when reusing data:
df.cache()
df.show()
Or use:
df.persist()
---
### 10. Summary of Concepts Covered
• Column transformations and renaming
• Filtering and conditional logic
• Grouping, aggregating, and sorting
• Handling nulls and duplicates
• All types of joins
• Working with dates and window functions
• Caching for performance
---
### Exercise
1. Load two CSV datasets and perform different types of joins
2. Add a new column with a custom label based on a condition
3. Aggregate salary data by department and show top-paid employees per department using window functions
4. Practice caching and observe performance
---
#Python #PySpark #DataEngineering #BigData #ETL #ApacheSpark
https://yangx.top/DataScienceM
❤2
Topic: Python PySpark Data Sheet – Part 3 of 3: Advanced Operations, MLlib, and Deployment
---
### 1. Working with UDFs (User Defined Functions)
UDFs allow custom Python functions to be used in PySpark transformations.
#### Define and Use a UDF:
> ⚠️ Note: UDFs are less optimized than built-in functions. Use built-ins when possible.
---
### 2. Working with JSON and Parquet Files
#### Read JSON File:
#### Read & Write Parquet File:
---
### 3. Using PySpark MLlib (Machine Learning Library)
MLlib is Spark's scalable ML library with tools for classification, regression, clustering, and more.
---
#### Steps in a Typical ML Pipeline:
• Load and prepare data
• Feature engineering
• Model training
• Evaluation
• Prediction
---
### 4. Example: Logistic Regression in PySpark
#### Step 1: Prepare Data
#### Step 2: Train Model
#### Step 3: Make Predictions
---
### 5. Model Evaluation
---
### 6. Save and Load Models
---
### 7. PySpark with Pandas API on Spark
For small-medium data (pandas-compatible), use
> Works like Pandas, but with Spark backend.
---
### 8. Scheduling & Cluster Deployment
PySpark can run:
• Locally
• On YARN (Hadoop)
• Mesos
• Kubernetes
• In Databricks, AWS EMR, Google Cloud Dataproc
Use
---
### 9. Tuning and Optimization Tips
• Cache reused DataFrames
• Use built-in functions instead of UDFs
• Repartition if data is skewed
• Avoid using
---
### 10. Summary of Part 3
• Custom logic with UDFs
• Working with JSON, Parquet, and other formats
• Machine Learning with MLlib (Logistic Regression)
• Model evaluation and saving
• Integration with Pandas
• Deployment and optimization techniques
---
### Exercise
1. Load a dataset and train a logistic regression model
2. Add feature engineering using
3. Save and reload the model
4. Use UDFs to label predictions as “Yes/No”
5. Deploy your pipeline using
---
#Python #PySpark #MLlib #BigData #MachineLearning #ETL #ApacheSpark
https://yangx.top/DataScienceM
---
### 1. Working with UDFs (User Defined Functions)
UDFs allow custom Python functions to be used in PySpark transformations.
#### Define and Use a UDF:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
def label_age(age):
return "Senior" if age > 50 else "Adult"
label_udf = udf(label_age, StringType())
df.withColumn("AgeGroup", label_udf(df["Age"])).show()
> ⚠️ Note: UDFs are less optimized than built-in functions. Use built-ins when possible.
---
### 2. Working with JSON and Parquet Files
#### Read JSON File:
df_json = spark.read.json("data.json")
df_json.show()
#### Read & Write Parquet File:
df_parquet = spark.read.parquet("data.parquet")
df_parquet.write.parquet("output_folder/")
---
### 3. Using PySpark MLlib (Machine Learning Library)
MLlib is Spark's scalable ML library with tools for classification, regression, clustering, and more.
---
#### Steps in a Typical ML Pipeline:
• Load and prepare data
• Feature engineering
• Model training
• Evaluation
• Prediction
---
### 4. Example: Logistic Regression in PySpark
#### Step 1: Prepare Data
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import LogisticRegression
# Sample DataFrame
data = spark.createDataFrame([
(1.0, 2.0, 3.0, 1.0),
(2.0, 3.0, 4.0, 0.0),
(1.5, 2.5, 3.5, 1.0)
], ["f1", "f2", "f3", "label"])
# Combine features into a single vector
vec = VectorAssembler(inputCols=["f1", "f2", "f3"], outputCol="features")
data = vec.transform(data)
#### Step 2: Train Model
lr = LogisticRegression(featuresCol="features", labelCol="label")
model = lr.fit(data)
#### Step 3: Make Predictions
predictions = model.transform(data)
predictions.select("features", "label", "prediction").show()
---
### 5. Model Evaluation
from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator()
print("Accuracy:", evaluator.evaluate(predictions))
---
### 6. Save and Load Models
# Save
model.save("models/logistic_model")
# Load
from pyspark.ml.classification import LogisticRegressionModel
loaded_model = LogisticRegressionModel.load("models/logistic_model")
---
### 7. PySpark with Pandas API on Spark
For small-medium data (pandas-compatible), use
pyspark.pandas
:import pyspark.pandas as ps
pdf = ps.read_csv("data.csv")
pdf.head()
> Works like Pandas, but with Spark backend.
---
### 8. Scheduling & Cluster Deployment
PySpark can run:
• Locally
• On YARN (Hadoop)
• Mesos
• Kubernetes
• In Databricks, AWS EMR, Google Cloud Dataproc
Use
spark-submit
for production scripts:spark-submit my_script.py
---
### 9. Tuning and Optimization Tips
• Cache reused DataFrames
• Use built-in functions instead of UDFs
• Repartition if data is skewed
• Avoid using
collect()
on large datasets---
### 10. Summary of Part 3
• Custom logic with UDFs
• Working with JSON, Parquet, and other formats
• Machine Learning with MLlib (Logistic Regression)
• Model evaluation and saving
• Integration with Pandas
• Deployment and optimization techniques
---
### Exercise
1. Load a dataset and train a logistic regression model
2. Add feature engineering using
VectorAssembler
3. Save and reload the model
4. Use UDFs to label predictions as “Yes/No”
5. Deploy your pipeline using
spark-submit
---
#Python #PySpark #MLlib #BigData #MachineLearning #ETL #ApacheSpark
https://yangx.top/DataScienceM
❤3
🔥 Trending Repository: data-engineer-handbook
📝 Description: This is a repo with links to everything you'd ever want to learn about data engineering
🔗 Repository URL: https://github.com/DataExpert-io/data-engineer-handbook
📖 Readme: https://github.com/DataExpert-io/data-engineer-handbook#readme
📊 Statistics:
🌟 Stars: 36.3K stars
👀 Watchers: 429
🍴 Forks: 7K forks
💻 Programming Languages: Jupyter Notebook - Python - Makefile - Dockerfile - Shell
🏷️ Related Topics:
==================================
🧠 By: https://yangx.top/DataScienceM
📝 Description: This is a repo with links to everything you'd ever want to learn about data engineering
🔗 Repository URL: https://github.com/DataExpert-io/data-engineer-handbook
📖 Readme: https://github.com/DataExpert-io/data-engineer-handbook#readme
📊 Statistics:
🌟 Stars: 36.3K stars
👀 Watchers: 429
🍴 Forks: 7K forks
💻 Programming Languages: Jupyter Notebook - Python - Makefile - Dockerfile - Shell
🏷️ Related Topics:
#data #awesome #sql #bigdata #dataengineering #apachespark
==================================
🧠 By: https://yangx.top/DataScienceM