Data Science Machine Learning Data Analysis

Forwarded from Python | Machine Learning | Coding | R

📄

"Pandas Introduction to Advanced" booklet

👨🏻‍💻 You can't attend a #datascience interview and not be asked about Pandas! But you don't have to memorize all its methods and functions! With this booklet, you'll learn everything you need.

✔️ One of the most useful and interesting combinations is using #Pandas with #AWS Lambda, which can be very useful in real projects.

Please open Telegram to view this post

VIEW IN TELEGRAM

👍13

4.46K views20:09

Data Science Machine Learning Data Analysis

Forwarded from Python | Machine Learning | Coding | R

Please open Telegram to view this post

VIEW IN TELEGRAM

👍8

3.98K views08:19

Data Science Machine Learning Data Analysis

Forwarded from Python | Machine Learning | Coding | R

Exercises in Machine Learning

This book contains 75+ exercises

Download, read, and practice:
arxiv.org/pdf/2206.13446

GitHub Repo: https://github.com/michaelgutmann/ml-pen-and-paper-exercises

👍9

3.03K views01:00

Data Science Machine Learning Data Analysis

Forwarded from Python | Machine Learning | Coding | R

Linear Algebra

The 2nd best book on linear algebra with ~1000 practice problems. A MUST for AI & Machine Learning.

Completely FREE.

Download it: https://www.cs.ox.ac.uk/files/12921/book.pdf

Please open Telegram to view this post

VIEW IN TELEGRAM

Please open Telegram to view this post

VIEW IN TELEGRAM

👍4❤2

2.93K views05:44

Data Science Machine Learning Data Analysis

#MachineLearning Systems — Principles and Practices of Engineering Artificially Intelligent Systems: https://mlsysbook.ai/

open-source textbook focuses on how to design and implement AI systems effectively

Please open Telegram to view this post

VIEW IN TELEGRAM

❤5👍3

9.17K viewsedited 06:47

Data Science Machine Learning Data Analysis

Forwarded from Python | Machine Learning | Coding | R

🔗 Machine Learning from Scratch by Danny Friedman

https://dafriedman97.github.io/mlbook/content/introduction.html

Please open Telegram to view this post

VIEW IN TELEGRAM

👍4❤2

3.6K views05:29

Data Science Machine Learning Data Analysis

Forwarded from Python | Machine Learning | Coding | R

"Introduction to Probability for Data Science"

One of the best books on #Probability. Available FREE.

Download the book:
probability4datascience.com/download.html

Please open Telegram to view this post

VIEW IN TELEGRAM

Please open Telegram to view this post

VIEW IN TELEGRAM

👍7❤1

3.99K views20:50

Data Science Machine Learning Data Analysis

Forwarded from Python | Machine Learning | Coding | R

SciPy.pdf

206.4 KB

Unlock the full power of SciPy with my comprehensive cheat sheet!
Master essential functions for:

Function optimization and solving equations

Linear algebra operations

ODE integration and statistical analysis

Signal processing and spatial data manipulation

Data clustering and distance computation ...and much more!

💯

BEST DATA SCIENCE CHANNELS ON TELEGRAM

🌟

Please open Telegram to view this post

VIEW IN TELEGRAM

👍5

3.02K views03:36

Data Science Machine Learning Data Analysis

Forwarded from Python | Machine Learning | Coding | R

Numpy from basics to advanced.pdf

2.4 MB

📕

Mastering NumPy – From Basics to Advanced

NumPy is an essential library in the world of data science, widely recognized for its efficiency in numerical computations and data manipulation. This powerful tool simplifies complex operations with arrays, offering a faster and cleaner alternative to traditional Python lists and loops.

The "Mastering NumPy" booklet provides a comprehensive walkthrough—from array creation and indexing to mathematical/statistical operations and advanced topics like reshaping and stacking. All concepts are illustrated with clear, beginner-friendly examples, making it ideal for anyone aiming to boost their data handling skills.

🌟

Join the communities:

Please open Telegram to view this post

VIEW IN TELEGRAM

❤4👍1

3.4K views20:37

Data Science Machine Learning Data Analysis

Topic: Python PySpark Data Sheet – Part 1 of 3: Introduction, Setup, and Core Concepts

---

### 1. What is PySpark?

PySpark is the Python API for Apache Spark, a powerful distributed computing engine for big data processing.

PySpark allows you to leverage the full power of Apache Spark using Python, making it easier to:

• Handle massive datasets
• Perform distributed computing
• Run parallel data transformations

---

### 2. PySpark Ecosystem Components

• Spark SQL – Structured data queries with DataFrame and SQL APIs
• Spark Core – Fundamental engine for task scheduling and memory management
• Spark Streaming – Real-time data processing
• MLlib – Machine learning at scale
• GraphX – Graph computation

---

### 3. Why PySpark over Pandas?

| Feature | Pandas | PySpark |
| -------------- | --------------------- | ----------------------- |
| Scale | Single machine | Distributed (Cluster) |
| Speed | Slower for large data | Optimized execution |
| Language | Python | Python on JVM via Py4J |
| Learning Curve | Easier | Medium (Big Data focus) |

---

### 4. PySpark Setup in Local Machine

#### Install PySpark via pip:

pip install pyspark

#### Start PySpark Shell:

pyspark

#### Sample Code to Initialize SparkSession:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("MyApp") \
    .getOrCreate()

---

### 5. RDD vs DataFrame

| Feature | RDD | DataFrame |
| ------------ | ----------------------- | ------------------------------ |
| Type | Low-level API (objects) | High-level API (structured) |
| Optimization | Manual | Catalyst Optimizer (automatic) |
| Usage | Complex transformations | SQL-like operations |

---

### 6. Creating DataFrames

#### From Python List:

data = [("Alice", 25), ("Bob", 30)]
df = spark.createDataFrame(data, ["Name", "Age"])
df.show()

#### From CSV File:

df = spark.read.csv("file.csv", header=True, inferSchema=True)
df.show()

---

### 7. Inspecting DataFrames

df.printSchema()     # Schema info  
df.columns           # List column names  
df.describe().show() # Summary stats  
df.head(5)           # First 5 rows

---

### 8. Basic Transformations

df.select("Name").show()
df.filter(df["Age"] > 25).show()
df.withColumn("AgePlus10", df["Age"] + 10).show()
df.drop("Age").show()

---

### 9. Working with SQL

df.createOrReplaceTempView("people")
spark.sql("SELECT * FROM people WHERE Age > 25").show()

---

### 10. Writing Data

df.write.csv("output.csv", header=True)
df.write.parquet("output_parquet/")

---

### 11. Summary of Concepts Covered

• Spark architecture & PySpark setup
• Core components of PySpark
• Differences between RDD and DataFrames
• How to create, inspect, and manipulate DataFrames
• SQL support in Spark
• Reading/writing to/from storage

---

### Exercise

1. Load a sample CSV file and display the schema
2. Add a new column with a calculated value
3. Filter the rows based on a condition
4. Save the result as a new CSV or Parquet file

---

#Python #PySpark #BigData #ApacheSpark #DataEngineering #ETL

https://yangx.top/DataScienceM

❤4

1.77K viewsedited 05:51

Data Science Machine Learning Data Analysis

Topic: Python PySpark Data Sheet – Part 2 of 3: DataFrame Transformations, Joins, and Group Operations

---

### 1. Column Operations

PySpark supports various column-wise operations using expressions.

#### Select Specific Columns:

df.select("Name", "Age").show()

#### Create/Modify Column:

from pyspark.sql.functions import col

df.withColumn("AgePlus5", col("Age") + 5).show()

#### Rename a Column:

df.withColumnRenamed("Age", "UserAge").show()

#### Drop Column:

df.drop("Age").show()

---

### 2. Filtering and Conditional Logic

#### Filter Rows:

df.filter(col("Age") > 25).show()

#### Multiple Conditions:

df.filter((col("Age") > 25) & (col("Name") != "Alice")).show()

#### Using `when` for Conditional Columns:

from pyspark.sql.functions import when

df.withColumn("Category", when(col("Age") < 30, "Young").otherwise("Adult")).show()

---

### 3. Aggregations and Grouping

#### GroupBy + Aggregations:

df.groupBy("Department").count().show()
df.groupBy("Department").agg({"Salary": "avg"}).show()

#### Using Aggregate Functions:

from pyspark.sql.functions import avg, max, min, count

df.groupBy("Department").agg(
    avg("Salary").alias("AvgSalary"),
    max("Salary").alias("MaxSalary")
).show()

---

### 4. Sorting and Ordering

#### Sort by One or More Columns:

df.orderBy("Age").show()
df.orderBy(col("Salary").desc()).show()

---

### 5. Dropping Duplicates & Handling Missing Data

#### Drop Duplicates:

df.dropDuplicates(["Name", "Age"]).show()

#### Drop Rows with Nulls:

df.dropna().show()

#### Fill Null Values:

df.fillna({"Salary": 0}).show()

---

### 6. Joins in PySpark

PySpark supports various join types like SQL.

#### Types of Joins:

• inner
• left
• right
• outer
• left_semi
• left_anti

#### Example – Inner Join:

df1.join(df2, on="id", how="inner").show()

#### Left Join Example:

df1.join(df2, on="id", how="left").show()

---

### 7. Working with Dates and Timestamps

from pyspark.sql.functions import current_date, current_timestamp

df.withColumn("today", current_date()).show()
df.withColumn("now", current_timestamp()).show()

#### Date Formatting:

from pyspark.sql.functions import date_format

df.withColumn("formatted", date_format(col("Date"), "yyyy-MM-dd")).show()

---

### 8. Window Functions (Advanced Aggregations)

Used for operations like ranking, cumulative sum, and moving average.

from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

window_spec = Window.partitionBy("Department").orderBy("Salary")
df.withColumn("rank", row_number().over(window_spec)).show()

---

### 9. Caching and Persistence

Use caching for performance when reusing data:

df.cache()
df.show()

Or use:

df.persist()

---

### 10. Summary of Concepts Covered

• Column transformations and renaming
• Filtering and conditional logic
• Grouping, aggregating, and sorting
• Handling nulls and duplicates
• All types of joins
• Working with dates and window functions
• Caching for performance

---

### Exercise

1. Load two CSV datasets and perform different types of joins
2. Add a new column with a custom label based on a condition
3. Aggregate salary data by department and show top-paid employees per department using window functions
4. Practice caching and observe performance

---

#Python #PySpark #DataEngineering #BigData #ETL #ApacheSpark

https://yangx.top/DataScienceM

❤2

1.29K views06:04

Data Science Machine Learning Data Analysis

Topic: Python PySpark Data Sheet – Part 3 of 3: Advanced Operations, MLlib, and Deployment

---

### 1. Working with UDFs (User Defined Functions)

UDFs allow custom Python functions to be used in PySpark transformations.

#### Define and Use a UDF:

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def label_age(age):
    return "Senior" if age > 50 else "Adult"

label_udf = udf(label_age, StringType())

df.withColumn("AgeGroup", label_udf(df["Age"])).show()

> ⚠️ Note: UDFs are less optimized than built-in functions. Use built-ins when possible.

---

### 2. Working with JSON and Parquet Files

#### Read JSON File:

df_json = spark.read.json("data.json")
df_json.show()

#### Read & Write Parquet File:

df_parquet = spark.read.parquet("data.parquet")
df_parquet.write.parquet("output_folder/")

---

### 3. Using PySpark MLlib (Machine Learning Library)

MLlib is Spark's scalable ML library with tools for classification, regression, clustering, and more.

---

#### Steps in a Typical ML Pipeline:

• Load and prepare data
• Feature engineering
• Model training
• Evaluation
• Prediction

---

### 4. Example: Logistic Regression in PySpark

#### Step 1: Prepare Data

from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import LogisticRegression

# Sample DataFrame
data = spark.createDataFrame([
    (1.0, 2.0, 3.0, 1.0),
    (2.0, 3.0, 4.0, 0.0),
    (1.5, 2.5, 3.5, 1.0)
], ["f1", "f2", "f3", "label"])

# Combine features into a single vector
vec = VectorAssembler(inputCols=["f1", "f2", "f3"], outputCol="features")
data = vec.transform(data)

#### Step 2: Train Model

lr = LogisticRegression(featuresCol="features", labelCol="label")
model = lr.fit(data)

#### Step 3: Make Predictions

predictions = model.transform(data)
predictions.select("features", "label", "prediction").show()

---

### 5. Model Evaluation

from pyspark.ml.evaluation import BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator()
print("Accuracy:", evaluator.evaluate(predictions))

---

### 6. Save and Load Models

# Save
model.save("models/logistic_model")

# Load
from pyspark.ml.classification import LogisticRegressionModel
loaded_model = LogisticRegressionModel.load("models/logistic_model")

---

### 7. PySpark with Pandas API on Spark

For small-medium data (pandas-compatible), use pyspark.pandas:

import pyspark.pandas as ps

pdf = ps.read_csv("data.csv")
pdf.head()

> Works like Pandas, but with Spark backend.

---

### 8. Scheduling & Cluster Deployment

PySpark can run:

• Locally
• On YARN (Hadoop)
• Mesos
• Kubernetes
• In Databricks, AWS EMR, Google Cloud Dataproc

Use spark-submit for production scripts:

spark-submit my_script.py

---

### 9. Tuning and Optimization Tips

• Cache reused DataFrames
• Use built-in functions instead of UDFs
• Repartition if data is skewed
• Avoid using collect() on large datasets

---

### 10. Summary of Part 3

• Custom logic with UDFs
• Working with JSON, Parquet, and other formats
• Machine Learning with MLlib (Logistic Regression)
• Model evaluation and saving
• Integration with Pandas
• Deployment and optimization techniques

---

### Exercise

1. Load a dataset and train a logistic regression model
2. Add feature engineering using VectorAssembler
3. Save and reload the model
4. Use UDFs to label predictions as “Yes/No”
5. Deploy your pipeline using spark-submit

---

#Python #PySpark #MLlib #BigData #MachineLearning #ETL #ApacheSpark

https://yangx.top/DataScienceM

❤3

1.2K views09:06

Data Science Machine Learning Data Analysis

🔥 Trending Repository: data-engineer-handbook

📝 Description: This is a repo with links to everything you'd ever want to learn about data engineering

🔗 Repository URL: https://github.com/DataExpert-io/data-engineer-handbook

📖 Readme: https://github.com/DataExpert-io/data-engineer-handbook#readme

📊 Statistics:
🌟 Stars: 36.3K stars
👀 Watchers: 429
🍴 Forks: 7K forks

💻 Programming Languages: Jupyter Notebook - Python - Makefile - Dockerfile - Shell

🏷️ Related Topics:

==================================
🧠 By: https://yangx.top/DataScienceM

569 views11:27

📥 Download Zip

🚀 Explore Data Science

About

Blog

Apps

Platform